[web2py] Re: ORM (?) : A Revisit, NOT a Rebuttal

Arnon Marcus Sat, 04 May 2013 05:39:14 -0700

>
>
> Technically, it would be db.Country(Name=countryName).City(db.City.Name== 
> cityName).select().first()
> .
>
>
10x for correcting me.


> The way you have coded it, it is actually 6 round trips -- there are 2 
> selects per function -- one for the country, and a second for the city. But 
> that's not how you would do it in web2py anyway. Instead, you would issue 
> no selects and instead do it with just a single update -- so a total of 2 
> round trips to the db (i.e., 2 updates) :
>
> def a_child_was_born_in(countryName, cityName):
>     query = (db.City.Name == cityName) & (db.City.Country.belongs(db.
> Country.Name == countryName))
>     db(query).update(Population=db.City.Population + 1)
>
>
Obviously I could have done it this way, but I wanted an example that I can 
use to illustrate the differences.
Your suggestion is a circumstantial optimization to an example that is 
meant to show something else.
It is a testament to my poor example, more than it is for ORM's weaknesses.
It is before the optimization process has begun. Lets see where we go from 
here.
 

> So, in order to do the update, we do not first have to query the database 
> to retrieve the record. This is actually an advantage over the ORM, which 
> requires that you first retrieve the record before updating it.
>

That is not true. I could have done it otherwise - this is just for the 
sake of the example.
There is nothing architecturally preventing an ORM from issuing an update 
without a select.
It's an supplementation detail, not an architectural one. You may have 
encountered ORMs that can't support that, but that doesn't mean that the 
problem is in the architecture.
 

> The ORM will issue two queries to get the record if lazy loading is used, 
> or one if eager loading, a join, or a subquery is used.
>

Correct. Again, for the sake of this example.
 

> Furthermore, because web2py doesn't need to retrieve the records, it also 
> has a processing and memory advantage over the ORM, which must create the 
> record object, add it to the session, and hold it in memory.
>

That again is a circumstantial issue, pertaining to this example. In most 
cases, there would be more reuse of the objects, so creating them would be 
beneficial. Also, I don't see how in web2py's DAL it is any different - all 
of the objects you are using in the query/update are objects that has to be 
created for you to use them...
In fact, in an ORM, only the objects that are needed for the query may be 
created for each transaction, but due to how web2py is executing, you are 
actually having to create the entire-schema of objects from scratch at 
every request, so I can't see how web2py would create less-objects - it 
actually would create more...
 

>  
>
>> Now, lets say we want to optimize that, so we do a "Lazy" version of 
>> those functions.
>>
>
> There's not much to optimize here. If you don't know ahead of time that 
> you will be making two updates to the same record (which may possibly 
> negate each other),
>

No. My point here was that a lazy-query might be beneficial in some cases, 
so it could be integrated into the functions, at least optionally, and be 
chosen by the caller, based on the circumstances. It also has nothing to do 
with anticipating the override-back-to-default. You are conflating 2 
different mechanisms that are at play here.
Making it lazy, is mechanism that is being enabled by the "Identity Mapper" 
in this case, as these are two name-spaces that need to re-use the same 
object. Stick to the original example if you want to judge the 
line-of-thought - obviously it would brake down once you optimize it the 
way you have, but that's irrelevant to the example at hand.
 

> I think the minimum number of db hits is two. You could retrieve the 
> record twice, defer the first update, recognize that the second update 
> cancels the first, and then make no update -- which is still 2 hits (well, 
> 1 hit if you cache the query).
>

Again, you are completely changing the example, in order to optimize the 
use-case differently.
Obviously that is possible, but still irrelevant, as the example is a mere 
means-to-en-end of explaining something else. All you are saying here, is 
that it is a bad example, and that may be so, but it has nothing to do with 
alternative optimization approaches.
 

> Or you could just make the 2 updates (as above). In any case, I believe 
> the ORM actually requires a minimum of 4 hits (see below), so web2py is 
> still doing a lot better.
>

Wrong. (see below)
 

>
> Assuming this is SQLA, I don't think that's quite the right syntax -- it 
> appears you are creating object instances rather than issuing queries. I 
> believe it should be something like this:
>
>
This isn't SQLA - it isn't anything at this point - just a suggestion.
 

> def a_child_was_born_in(countryName, cityName):
>     city = session.query(City).join(Country)\
>            .filter(Country.Name == countryName)\
>            .filter(City.Name == cityName).first()
>     city.Population += 1
>
> The above does a join and therefore gets it down to a single query for the 
> select. Otherwise, you could just query for the country, then access the 
> "City" attribute, which would lazily issue a second query (though only when 
> the first function is called).
>

Exactly. This would be how it would be done in SQLA. A single query with a 
join.
 

>
> The syntactic difference is small, but the semantic implication is 
>> profound.
>>
>
> Yes, but not quite in the way you think.
>

We shell see in a minute.
 

>  
>
>> The automatic cache-mechanism in the ORM will detect that we 
>> are querying the same record, and so would not query the database in the 
>> second function - just return the same object already in memory.
>>
>
> Again, assuming this is SQLA, that's not how it works. SQLA does not cache 
> queries -- when you run a query, it doesn't know what record will be 
> retrieved, so it doesn't know whether it already has the associated object 
> in the session. Hence, it will re-run the query both times. (The exception 
> to this is when you use .get() to fetch a record by primary key, which we 
> are not doing here.)
>

See, here is where you got it all wrong and where everything after that 
brakes down.
I'm not sure how SQLA is doing this, I can look it up, but I did not assume 
this to be SQLA.
But I think you are wrong even in that case. An ORM layer should be smart 
enough to know that if you are issuing the same query twice within a 
transaction, than it should know it already has what it needs. It does not 
need to cache the query itself to do that. The way it works, is that for 
each object, there are attributes that have been asked for by a query. So, 
say, in this case, in the second query (in the second function) the 
country-object of "France", will have already been existing in memory, so 
it would be reused, even if the original query was a join. The ORM should 
be smart enough to generate the "France" object, even though it is not 
queried explicitly. It might not even have a primary-key in 
place, because the join operations did not return it, but it would still 
exist as an object in memory, and have it's "Name" attribute populated with 
"France", and his "City" attribute populated by some kind of sequence, that 
would have the "Paris" City-object in place (again, with not even a 
primary-key attribute inhabited), which will then have a "Population" 
attribute, with a value of 2000001. So what would happen is that the ORM 
object-graph would be traversed based on the query-parameters (again, using 
MY proposed syntax, NOT the SQLA one that you pointed-out), and for each 
object already existing there that matches the filter, it 
would traverse inwards without issuing a query - eventually getting to the 
"Paris" city-object - the EXACT SAME INSTANCE (!) would be returned, and 
the value of the "Population" attribute, would be knocked back by one.
Now granted, for this to work in an "Identity-Map", it should be flexible 
enough to identify the already-existing "Paris" city object, even though it 
is not asked for by a primary-key, since it is the only one in existence in 
this case, as is matching the filter-criteria.
So to summarize, there would be a single-select at the first function-call, 
then a lazy-update, than a re-use of the city-object in the second call, 
then another lazy-update, and then a unit-of-work check, that would compare 
the attributes with their original values, resulting with zero pending 
updates - so all-in-all, a single operation as I've described.
 

>  
>
>> But an ORM can have an "Identity Mapper", that would make sure they the 
>> same object would be returned,
>> It would be bound to two different name-spaces, but it would be the same 
>> object.
>> Now we could implement a "Truely" lazy update. The increment that is done 
>> in the first function, would be reflected in the second one, because the 
>> same object would be returned,
>>
>
> Another problem here. Whenever you execute a new query, SQLA flushes the 
> pending changes. So, when you run the query in the second function, it will 
> first issue the update to the database from the first change. Once it has 
> done that, it will ultimately also have to issue the update from the second 
> function (though perhaps at some later time) in order to have the correct 
> value in the database.
>

SQLA would not "necessarily" flush these objects in this case, because this 
is the same transaction. The ACID "C" of consistency, would assure it that 
whatever it changed internally, would not have to be flushed before the 
next query, since nothing else could have changed the transaction that was 
produced by the database on it's end.
Granted, the default-behavior of a session in SQLA is to do a flush() 
before each query (called "autoflushing") but this could be disabled, if 
you want lazy-updates:

Flushing <http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#flushing>

When the 
Session<http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#sqlalchemy.orm.session.Session>
 is 
used with its default configuration, the flush step is nearly always done 
transparently. Specifically, the flush occurs before any individual 
Query<http://docs.sqlalchemy.org/en/rel_0_8/orm/query.html#sqlalchemy.orm.query.Query>
 is 
issued, as well as within the 
commit()<http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#sqlalchemy.orm.session.Session.commit>
 call 
before the transaction is committed. It also occurs before a SAVEPOINT is 
issued when 
begin_nested()<http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#sqlalchemy.orm.session.Session.begin_nested>
 is 
used.

Regardless of the autoflush setting, a flush can always be forced by 
issuing 
flush()<http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#sqlalchemy.orm.session.Session.flush>
:

session.flush()

The “flush-on-Query” aspect of the behavior can be disabled by constructing 
sessionmaker<http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#sqlalchemy.orm.session.sessionmaker>
 with 
the flagautoflush=False:

Session = sessionmaker(autoflush=False)

Additionally, autoflush can be temporarily disabled by setting the autoflush 
flag 
at any time:

mysession = Session()mysession.autoflush = False

Some autoflush-disable recipes are available at 
DisableAutoFlush<http://www.sqlalchemy.org/trac/wiki/UsageRecipes/DisableAutoflush>
.
 

>
> So, I believe we have a minimum of 4 database hits with the ORM (5 if you 
> lazy load the cities when running the initial query) versus 2 hits with 
> web2py. 
>

Nope. We have a single database hit with an ORM.
 

> We also have more processing and memory usage with the ORM.
>

Again, since it does not have to re-"execute" the entire 
database-schema-object-creation on each request, than you got it 
backwards... 
 

>
> To summarize:
>
>    - The ORM doesn't do direct updates to the database, so it must first 
>    select the records before updating them, unlike web2py, which can issue a 
>    direct update.
>
> I am sure that even using SQLA, it is possible to do a direct-update - if 
not using the ORM layer, then surely when using the Core. But such a 
feature could very easily be added-in, even to the ORM layer, even not as a 
built-in but as an extra method on your ORM class - again, SQLA, just for 
example, is not restricting the direct use of the Core - which lets you do 
everything you can with the DAL, and then some...
And I am suggesting developing the same kind of layered-design, so you 
could still do this via the DAL once you have an ORM on-top of it, though I 
wouldn't do that explicitly - I would build-up a custom-methode within my 
ORM class, that would do that using the DAL for me.
But this entire fiasco is completely irrelevant to the example presented, 
as I used an (arguably poorly-chosen) example just to explaid other 
capabilities.
But as you saw, even using my example, it still could be faster and more 
efficient than the DAL by itself.

>
>    - 
>    - The ORM doesn't cache queries, 
>
> That's irrelevant since it is not necessary. 

>
>    - so it must re-run the query twice, even though the record in 
>    question is already in the session after the first query.
>
> Only if you use auto-flush (in SQLA's case)
 

>
>    - The ORM flushes pending changes before each query, so the first 
>    update goes to the database before the second query is run, ultimately 
>    necessitating both updates.
>
> (see above)
Also, come to think of it, I am not sure the "flushing" is invalidating the 
objects that were cashed, I think only a "commit()" does that, so not even 
if ou leave auto-flush "on", so that the first "update" is being sent when 
the runtime gets to the second query, it may STILL not follow that the 
second-query would have to go to the database - it may already have the 
caches-objects still "valid", even tough the flush() pushed them to the 
transaction on the database. 
 

>
>    - The ORM must create the record object and manage various operations 
>    in the session, using more memory and processing time than web2py.
>
> That may be so, but it does not automatically mean that it would be less 
efficient overall, as it has some optimization-capabilities that the DAL 
does not have.
 

>
> One thing to keep in mind. Although the web2py DAL does not implement an 
> ORM design pattern, it is still object-oriented. We have Table, Field, Set, 
> Rows, Row, etc. These objects have various methods and attributes that 
> enable quite a bit of functionality and flexibility. I think most of the 
> behaviors you like so much in the SQLA ORM could in principle be 
> implemented within the DAL's objects (e.g., optional eager loading of 
> records within relationship attributes, or persistence upon first access 
> when lazily loaded). It may be worth thinking more about those ideas rather 
> than pushing the idea of a full-blown ORM layer.
>

That is definitely an option.
So long as all the statefull-goodies are in place to support such features.

>
> Anthony
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to web2py+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[web2py] Re: ORM (?) : A Revisit, NOT a Rebuttal

Reply via email to