Re: [sqlalchemy] require explicit FROMs

Mike Bayer Sun, 25 Jun 2017 18:59:16 -0700


On 06/25/2017 12:15 PM, Dave Vitek wrote:

Mike,
Thanks for the thorough feedback. It isn't surprising that there's a lotof code out there relying on the current behavior, but internalsqlalchemy reliance is certainly problematic. It sounds difficult towalk back the existing behavior at this point and I understand theapprehension about even adding a warning at this point.

The internals by and large don't rely on implicit FROM that much in anycase, since the internals need to do very complex transformations on SQLfragments and queries that are completely arbitrary, and the FROM addsinformation to the query that is otherwise often lost during thesetransformations. There are of course places where implicit FROM doescome into play, and these areas currently work without issue. Butoverall these use cases are not at all like the end-user use case Ithink you're referring towards.

Implicit FROM is not a feature that works incorrectly, it's a featurethat works perfectly, but delays the symptoms of other particular bugsfrom appearing as quickly as we'd like. I'd like to keep the focus hereon solving the problem of inadvertent bugs, and not on removal of thefeature itself which is extremely useful, intuitive, and widely relied upon.

Perhaps it could be an opt-in warning for projects that subscribe to the"explicit from" requirement. People who contribute to multiple projectsmight still be annoyed that sqlalchemy complains differently from oneproject to the next, but I think there's plenty of precedent forprojects having rules about what subset of library and language featuresthey allow. Once a project has been bitten by this sort of bug one toomany times, they might look into enabling the warning.However, in my experience warnings are ignored unless they trigger testfailures.

I actually have not had much in the way of complaints about this issueoccurring in the wild. I see it a lot with the ORM features describedabove but that's due to internal bugs within intricate systems that Ifix myself. It's actually a great idea to set the warnings filter sothat this does occur. py.test is aggressively moving towards makingwarnings visible and making it easy to raise them also.


Furthermore, I am worried in part about these bugs being

security bugs, in which case I might prefer an exception even in aproduction setting. Then again, I might not, since most of our implicitFROMs that weren't recently introduced were producing intended behavior.
With respect to detecting implicit FROMs by noticing slow queries orunexpected behavior, that works OK with some queries, but it still takessignificantly more human time than detecting things more explicitly andearlier in the pipeline.
There are cases where an under-constrained query runs faster, and fairlyspecific inputs would be required for a human to notice the badbehavior. Specifically, I think under-constrained EXISTS clausesfrequently go undetected for some time. Some query that is supposed tocheck whether a particular row in a table has some relationship ends upchecking whether any row in the table has some relationship. To makematters worse, it isn't entirely unusual for these two behaviors to beindistinguishable for typical data sets (perhaps because all the rowsbehave the same, or the tables are frequently singletons). In my worstnightmare, the intent might be to check if Alice has permission to do Xto Y, but the query ends up asking whether Alice has permission to do Xto anything, or whether anyone has permission to do X to Y.

There's lots of ways to write SQL queries that don't SELECT from thecorrect rows, or do get the correct rows but perform poorly (the mostcommon is that indexes are missed), or a combination of both. Overall,you can't get away with only checking early or only checking late in thepipeline, there need to be good unit tests that cover every query usedagainst decent test data, and there needs to be continuous analysis ofperformance and SQL emitted in the running application. I know peopleare doing this because that's how they alert me when the ORM ismis-interpreting a query.

Back to the issue at hand, I still maintain that providing a "workstotally differently" mode is a recipe for headaches and complexity foreveryone, adding more codepaths that add to the long term testingburden, diluting the usefulness of "implicit FROM" rather than fixingit, confusing users with a choice they shouldn't have to make, and atthe end of the day providing "detects cartesian products" logic to onlya subset of applications that opt in to this alternate mode and writetheir code in a totally new way.

Alternatively, logic that does do something like what I mentioned in myother email, that is a "linter" that can scan a select() for cartesianproducts would have the advantages of: 1. it can be "opt in" perenvironment, e.g. development and not production, so that any codebasecan use it 2. it will find errors of the "forgot to add a JOIN" varietyno matter how it occurred, whether it be user error at the Core or ORMlevel or if an ORM internal produced it 3. it would have no impact onexisting applications or internals as it would be a self-containedroutine that only runs if enabled. 4. it can be released as something"experimental" so that people can just play with it for a year or sohaving it only emit harmless warnings by default when turned on, untilwe find all the problems.

One disadvantage is that it adds a performance hit. This is mitigatedby the fact that it is optional and can be enabled only in development /tests. It also would be part of the statement compilation step so wouldbe part of what we cache when statement caching is used, which in 1.2 isused more prevalently than before within the internals and is alsoused by end-user code now in the form of the BakedQuery system.

Another disadvantage is that it is reinventing part of what the queryplanner is sort of doing already, but as you mentioned, it's more of adirect structural problem that is not as directly indicated within anEXPLAIN, and at least moves up the detection of the issue.

Finally, it may be problematic that it is not always correct, more inthe case of emitting false positives, or even that there are somequeries for which it can't work correctly. It would need to be seen howdifficult this is.


Here is how it would work.

We are looking for "orphaned FROM" clauses, or even "split FROMclauses". It's a graph problem (since SQL is an expression tree, allthese problems are just graph problems). Given a query like this:


select([a.something, b.something]).\
    select_from(b).join(c, a.c_id == c.id).\
    select_from(d).\
    where(
        and_(
             a.id == b.id,
             a.data == 10,
             d.data == 5,
             e.data = 10,
             d.e_data > some_func(e.thing)
        )
    )

we then create a graph of the selectables that come from our FROM list,
which are a, b.join(c), d, and e:

a --- b.join(c)   (joins on a.c_id == c.id)
a --- b.join(c)   (a.id == b.id)
d --- e           (d.e_data > some_func(e.thing))

then the graph shows that we have two disconnected graphs, the "abc"graph and the "de" graph; that fails the lint.


Another situation would be dealing with correlation:

subq = select([b.something]).where(b.id == a.id).correlate(a).as_scalar()

stmt = select([a.something, subq])

the graph for subq would be "a --- b", even though "a" is not in theFROM list, we know we are correlating to it at compile time so that isgiven as an adequate "link".


the graph for "stmt" would be just "a" because subq is not in the FROM list.

This algorithm can of course be written and used right now within a@compiles recipe as I stated before.

Keep in mind if you want to stick with the approach you have, you canstill just do that within @compiles. But this graph-linter could be aneat thing, e.g. a mini-query planner that runs on the SQL expressiontree. It could even look for Index() objects in the schema and try toanticipate table scans. Linters are popular, and this would likely bethe first of its kind. If it's really something people want to work on,it would likely be popular.

The compile hook looks great and I will take a look at the test failurescaused by reliance on these features in the ORM, if only to see whetherwe use or foresee using those features.
- Dave

On 6/24/2017 5:33 PM, Mike Bayer wrote:
On 06/24/2017 10:53 AM, Dave Vitek wrote:
Hi all,

I'll start with an example.  Assume A and B are Table objects:

 >>> print select([A.id], from_obj=[A], whereclause=(B.field == 123))
SELECT A.id FROM A, B WHERE B.id = 123
As a convenience, sqlalchemy has added B to the FROM clause on theassumption that I meant to do this.
However, in my experience, this behavior has hidden many bugs.
So yes, the implicit FROM behavior is for a very long time theultimate symptom of many other bugs, probably moreso in the ORM itselfwithin routines such as eager loading and joined table inheritance. Alot of ORM bugs manifest as this behavior. However, as often as thisis the clear sign of a bug, it is actually not a bug in a whole bunchof other cases when the "FROM comma list" is what's intended.
 It isn't
a feature we intentionally rely on. Usually, someone forgot a joinand the query sqlalchemy invents is drastically underconstrained.These bugs sometimes go unnoticed for a long time. I want to raisean exception when this happens instead of letting the computer take aguess at what the human meant.
So to continue from my previous statement that "comma in the FROMclause" is more often than not not what was intended, there is a largeclass of exceptions to this, which is all those times when this *is*what's intended.
Within the ORM, the most common place the FROM clause w/ comma is seenis in the lazy load with a "secondary" table. In modern versions,the two tables are added as explicit FROM, so your patch still allowsthis behavior to work, though in older versions it would have failed:
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


atob = Table(
    'atob', Base.metadata,
    Column('aid', ForeignKey('a.id'), primary_key=True),
    Column('bid', ForeignKey('b.id'), primary_key=True)
)


class A(Base):
    __tablename__ = 'a'
    id = Column(Integer, primary_key=True)
    bs = relationship("B", secondary=atob)


class B(Base):
    __tablename__ = 'b'
    id = Column(Integer, primary_key=True)

e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
s = Session(e)
s.add(A(bs=[B(), B()]))
s.commit()
a1 = s.query(A).first()
print a1.bs


lazyload at the end is:

SELECT b.id AS b_id
FROM b, atob
WHERE ? = atob.aid AND b.id = atob.bid
So this works with your patch because that routine is using explicitselect_from(). However, there are a lot of other codepaths in the ORMthat still rely on implicit FROM and will fail with this change.
I have modified _get_display_froms from selectable.py to return fewerFROM entities.
I am pleased that you have the motivation to dig in and work on thiscode, and you've produced a quality patch that has taken the time tounderstand how this works. So I think we can work here to possiblycome up with a behavioral adjustment though it's not certain yet.
So far, this seems to be working.  Can you think of any
pitfalls with this change?
It's always a good idea to run the tests to see. TheREADME.unittests.rst file has lots of information on how to do thisbut for a deep change like this it's as easy as "python setup.pytest". There are lots of cases that break, many because you've askedthat the feature be "global opt in" and that isn't yet implemented,but also many that would need to be fixed even if the "global opt in"is enabled.
I've pushed up your patch to gerrit here:

https://gerrit.sqlalchemy.org/#/c/441/
The builds there are both the main SQLAlchemy test suite as well as aselection of test suites from the Openstack project, and in this caseboth fail:
https://jenkins.sqlalchemy.org/job/sqlalchemy_gerrit/962/

https://jenkins.sqlalchemy.org/job/openstack_gerrit/824/
(note that the above links are only good for a few days as the buildsget automatically truncated).
It's not surprising that the SQLAlchemy suite has failures, though ithas a lot, but even the Openstack suite has failures and that is muchmore surprising, because Openstack does not use the fancy mappingpatterns that we see failing in the main SQLAlchemy tests.
Within the SQLAlchemy suite, we have failures in the aaa_profilingsuite, which is testing that the number of Python calls in a select()is limited to a known value; for these particular tests the guidelinesare pretty strict, so the change adds a small performance overhead,for example:
Adjusted function call count 145 not within 5.0% of expected 157,platform 2.7_postgresql_psycopg2_dbapiunicode_cextensions
That's not a deal breaker as those tests are intended just to alertwhen the performance profile of something changes. Additionally,you've stated you would want this to be "opt-in", so one would beopting in to that small performance hit in any case.
The patch causes many failures just in the Core tutorial, meaningthere are a bunch of basic SELECT examples that have been there foryears that don't work under strict mode:
OperationalError: (sqlite3.OperationalError) no such column:users.fullname [SQL: u'SELECT users.fullname || ? ||addresses.email_address AS title \nWHERE users.id = addresses.user_idAND users.name BETWEEN ? AND ? AND (addresses.email_address LIKE ? ORaddresses.email_address LIKE ?)'] [parameters: (', ', 'm', 'z','%@aol.com', '%@msn.com')]
But those are end-user queries and if they had opted in to thealternate behavior, they'd not be writing them that way, OK.
Then, lots of ORM-inheritance kinds of tests, CTE tests, and othersbreak with this change, and in many cases, the Core / ORM itself isrelying on implicit FROM taking place in an internal sense, so theapplication that "globally opts in" here would still fail on all theseother cases, unless fixed. Many if not all of these breakages may befixable by reworking all the various logic to no longer make use ofthe implicit FROM logic, however that would be lots of complexity anddestabilization to make that happen.
One example is, using the Company/Person/Engineer set of models thatare similar to those introduced athttp://docs.sqlalchemy.org/en/latest/orm/inheritance.html#joined-table-inheritance,the following query fails:
    sess.query(Person)
                    .with_polymorphic(
                        Engineer,
                        people.outerjoin(engineers))
                    .order_by(Person.person_id)
The above query names only one entity, Person, and does not outwardlybreak any rules about adding additional FROM objects implicitly,however the logic that triggers when we go to load additional columnsthat have to do with Manager and Boss objects (another form of lazyload) renders an invalid SELECT:
OperationalError: (sqlite3.OperationalError) no such column:managers.manager_name [SQL: u'SELECT managers.manager_name ASmanagers_manager_name, managers.status AS managers_status,boss.golf_swing AS boss_golf_swing \nWHERE ? = managers.person_id ANDmanagers.person_id = boss.boss_id'] [parameters: (3,)]
The query it wants to emit is:
SELECT managers.status AS managers_status, boss.golf_swing ASboss_golf_swing, managers.manager_name AS managers_manager_name
FROM managers, boss
WHERE ? = managers.person_id AND managers.person_id = boss.boss_id
So fixing cases like that means a lot of new work and testing, not tomention that all of these cases must be tested in *two* entirelydifferent behavioral contracts, making this a much greater maintenanceand testing burden. Any time we fix a new bug in the area of SELECTrendering in Core or ORM, we will often have to write twice as manytests to make sure they work in "alternate FROM mode" as well.
Over in the Openstack suite, here's a Keystone test that fails:
keystone.tests.unit.test_backend_sql.SqlIdentity.test_list_users_with_unique_id_and_idp_id
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no suchcolumn: federated_user.user_id [SQL: u'SELECT user.enabled ASuser_enabled, user.id AS user_id, user.domain_id AS user_domain_id,user.extra AS user_extra, user.default_project_id ASuser_default_project_id, user.created_at AS user_created_at,user.last_active_at AS user_last_active_at \nFROM user LEFT OUTER JOINlocal_user ON user.id = local_user.user_id AND user.domain_id =local_user.domain_id \nWHERE user.id = federated_user.user_id ANDfederated_user.unique_id = ? AND federated_user.idp_id = ?'][parameters: ('9a5a4599df20426c9e9e6c86c8e36d3b', 'ORG_IDP')]
There's seven more in that variety in Keystone, looking at the querythey are testing it is:
  query = session.query(model.User).outerjoin(model.LocalUser)
  query = query.filter(model.User.id == model.FederatedUser.user_id)
  query = self._update_query_with_federated_statements(hints, query)
  user_refs = sql.filter_limit_query(model.User, query, hints)
  return [identity_base.filter_user(x.to_dict()) for x in user_refs]
We can guess that they are probably making use of the implicit FROMfeature within _update_query_with_federated_statements orfilter_limit_query(). Curious, I dug into Keystone to see what querythey intend to render; I thought hey, perhaps it would be funny ifthey actually were getting a cartesian product here and your featurehas "found" it. Although, if that were the case, already I thinkthis points more to the kind of feature I'd prefer, which is thatinstead of it just emitting the wrong SQL, it emits the SQL exactly asit does now but emits a warning; that way, everyone gets to know abouttheir poor FROM listing and exactly why it is occurring, withouthaving to "globally opt in" to anything:
SELECT user.enabled AS user_enabled, user.id AS user_id,user.domain_id AS user_domain_id, user.extra AS user_extra,user.default_project_id AS user_default_project_id, user.created_at ASuser_created_at, user.last_active_at AS user_last_active_atFROM federated_user, user LEFT OUTER JOIN local_user ON user.id =local_user.user_id AND user.domain_id = local_user.domain_idWHERE user.id = federated_user.user_id AND federated_user.unique_id =? AND federated_user.idp_id = ?
So above, we can see they are in fact using the dreaded comma in theFROM and even in an unusual way, "FROM table1, table2 JOIN table3"which is very odd for SQL, however they are setting up the join fromfederated_user to the user table properly, and they know what they aredoing.
So let's think of "global opt in" from that perspective. Openstackapplication are enormous, sprawling apps that are developed andmaintained by dozens of people who often haven't met each other. Byrequiring that the feature is only available as "global opt in",Keystone in order to use this feature would need to do the same workwe did earlier; that they'd need to locate these queries and add inthe explicit FROM clause, or otherwise alter their logic that istaking advantage of the implicit FROM to simplify their code.Additionally, developers who work on Keystone would observe that forsome reason they don't understand, SQLAlchemy now behaves completelydifferently than it does when they work on another Openstack app likeNova that does not use this feature.
What would more likely happen is that Keystone just wouldn't make thischange; "global opt in" is too coarse grained to make migrationstowards using this feature simple and risk-free. That would lead toa balkanized chunk of very intricate code in the middle of Core thatdoesn't get used often, which is a recipe for heavy maintenanceburdens, when the behavior of FROM logic itself needs to be maintainedand now it must be maintained across two very different behavioralcontracts, one of which is almost never used.
The original use case you had is, "the current behavior leads tosubtle bugs". Why don't we re-think how to fix that directly, which Iwould propose is most easily grantable to the entire SQLAlchemyecosystem at once by using a warning.
The warning would be, if someone does this:

stmt = select([t1.c.a]).select_from(t1).where(t2.c.b == 5)
that is, they put explicit FROM but then suggested a second implicitFROM, and therefore are mixing the approaches, would emit a warning,and that's it.
For your case, you'd see these warnings in your logs, and I wouldargue this is sufficient for you to prevent the thing you're concernedabout,
which is the mixing of explicit / implicit FROM.
But I'm not even sure I'm comfortable with a warning here because aswe see in Keystone, they are using this feature successfully. Theyshouldn't have to see a warning if they like how it works now; thereis of course the warnings filter but I've observed people are rightlystill annoyed about warnings that are perceived as undeserved.
So far I've spent a lot of time showing my thought process in greatdetail here because I hope to illustrate that yes, the FROM comma-listthing is a problem, but the use is too much a core of SQLAlchemy'sbeing for it to really be changed, noting that I'd love to find a wayto improve this situation. An alternative use contract isdefinitely an option, but as given as a patch to Core I fear thisproduces a so-called "balkanized block" that greatly adds tomaintenance burden. I've had to remove lots of "pet features" over theyears that I was too liberal in accepting up front. Usually, if wecan really detect a pattern that "smells" of an error, and there's away to explicitly prevent it, we can make a warning in that case, buteven in that case I'm not sure people would appreciate being pushedaway from "implicit from"; it's a handy feature when used correctlyand simplifies code.
A more elaborate detection would be, look for implicit FROM, then lookinside the WHERE clause and detect that in fact the multiple FROMentries aren't being related to each other. That logic would be verycomplicated as well as a performance hit, and would often beincorrect. We would essentially be reinventing a small fragment ofwhat the database's query planner already does.
We can instead just ask the database using EXPLAIN to analyze thequery, or even just turn on slow query logging. This is pretty muchthe answer I'd normally have given in this area in general; yourdevelopers may write an inefficient SQL statement whether or notSQLAlchemy makes one particular class of them more likely, and toreally look for errors of this class, you should be looking at SQLlogs, running EXPLAIN on queries you see, and setting up slow querylogging on your development / CI servers. There are otherquery-planner error cases besides this one, such as missing indexes orother table-scan scenarios, gratuitous use of OFFSET, etc. There'slots of things that can make a query slow. Generalized safeguards inCI would guard against this one very specific class of error in any case.
After all that, what I can suggest for you is that since you arelooking to "raise an error", that is actually easy to do here usingevents or compiles. All you need to do is compare the calculatedFROM list to the explicit FROM list:
from sqlalchemy import table, column, select
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql import Select
from sqlalchemy import exc


@compiles(Select)
def _strict_froms(stmt, compiler, **kw):
    actual_froms = stmt.froms

    # TODO: tune this as desired
    if stmt._from_obj:
        # TODO: we can give you non-underscored access to
        # _from_obj
        legal_froms = stmt._from_obj
    else:
        legal_froms = set([
            col.table for col in stmt.inner_columns
        ])
    if len(legal_froms) != len(actual_froms):
        raise exc.CompileError("SELECT has implicit froms")

    return compiler.visit_select(stmt, **kw)


t1 = table('t1', column('a'))
t2 = table('t2', column('b'))


# works
print select([t1]).where(t2.c.b == 5).select_from(t1).select_from(t2)

# raises
print select([t1]).where(t2.c.b == 5)
The issue of "FROM with comma" *is* a problem I'd like to continue toexplore solutions towards. I've long considered various rules in_display_froms but none have ever gotten through the various issuesI've detailed above. I hope this helps to clarify the complexityand depth of this situation.
I've attached a copy of the patch in case you want to take a look. Ifthere is interest in merging the change, it would need to be madeopt-in (how is this best done globally--not in a per-query fashion?).
- Dave

--

SQLAlchemy -The Python SQL Toolkit and Object Relational Mapper


http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.

---You received this message because you are subscribed to the Google Groups "sqlalchemy" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] require explicit FROMs

Reply via email to