@pbreit : if I'm not mistaken, the "parse()" function in DAL handles a lot 
... basically read just any value returned from the database to the correct 
type (blob, integer, double, datetime, .... list:*, just to say a few), 
handles virtualfields and prepares the Rows object making the Storage(). 
All the resultset is fetched, "parsed()" line by line, column by column, 
and returned as a whole list. Also if operations were to be kept at a 
minimum, building a simple list of 1 million record take some time, (and 
memory).
Obviously the first shortcut is use limitby. The second is selecting only 
the needed columns (best practice anyway to limit the amount of wire 
transfer). No other shoutcuts.

@all: Giving that a million record is hardly displayed, SQLTABLEd, 
SQLFORM.gridded, pickled, jsoned, etc.....I think the "need" here is having 
a Storage() object instead of a tuple for accessing the values as row.id 
instead of row[0], etc as in current executesql.
I see two roads (to be considered as 1 AND 2 and also as 1 OR 2) here, but 
I don't know the implications of having to deal with SQLFORM.grid, for 
instance. 
1) skip the parse() and return a quasi-rows object "a la executesql", just 
turning the list of tuples as list of row objects, skipping virtual fields, 
references, etc.
2) approach the returned values as a generator, and optionally as an 
only-forward iterator "a la web.py iterbetter()" 
(https://github.com/webpy/webpy/blob/master/web/db.py#L646), i.e. you can 
fetch the results and cycle through them only one time 

Downsides for method 1):
- some types could not be converted at all, references would be unavailable 
(but for million records you should anyway have done a join), no 
virtualfields
- building a list of million records still requires time
- for large datasets having to work on them only "after" all the list is 
prepared could be taking some time
Downsides for method 2):
- may work as expected only if underlying db driver supports it
- in the need of cycling the resultset for the second time another query is 
required or you have to come up with you own memoization (i.e. prepare an 
empty list to append results you're interested in cycling again)

I used web.py a lot and also for millions of records it's very fast for 
returning results.
NB: the "time cut" here is the one passing from the "db is ready to return 
results" to "the queryset is done and prepared for me to work on it". 
The assumption on 2) is that rarely you need to access randomly a million 
of records - i.e. myrows[31289], myrows[129459] - and what you can do in 2 
loops on the same dataset is usually "accomplishable" (that word even exist 
? :D) in a single one.

-- 



Reply via email to