On 06/25/2017 12:15 PM, Dave Vitek wrote:
Mike,

Thanks for the thorough feedback. It isn't surprising that there's a lot of code out there relying on the current behavior, but internal sqlalchemy reliance is certainly problematic. It sounds difficult to walk back the existing behavior at this point and I understand the apprehension about even adding a warning at this point.


The internals by and large don't rely on implicit FROM that much in any case, since the internals need to do very complex transformations on SQL fragments and queries that are completely arbitrary, and the FROM adds information to the query that is otherwise often lost during these transformations. There are of course places where implicit FROM does come into play, and these areas currently work without issue. But overall these use cases are not at all like the end-user use case I think you're referring towards.

Implicit FROM is not a feature that works incorrectly, it's a feature that works perfectly, but delays the symptoms of other particular bugs from appearing as quickly as we'd like. I'd like to keep the focus here on solving the problem of inadvertent bugs, and not on removal of the feature itself which is extremely useful, intuitive, and widely relied upon.



Perhaps it could be an opt-in warning for projects that subscribe to the "explicit from" requirement. People who contribute to multiple projects might still be annoyed that sqlalchemy complains differently from one project to the next, but I think there's plenty of precedent for projects having rules about what subset of library and language features they allow. Once a project has been bitten by this sort of bug one too many times, they might look into enabling the warning. However, in my experience warnings are ignored unless they trigger test failures.

I actually have not had much in the way of complaints about this issue occurring in the wild. I see it a lot with the ORM features described above but that's due to internal bugs within intricate systems that I fix myself. It's actually a great idea to set the warnings filter so that this does occur. py.test is aggressively moving towards making warnings visible and making it easy to raise them also.

Furthermore, I am worried in part about these bugs being
security bugs, in which case I might prefer an exception even in a production setting. Then again, I might not, since most of our implicit FROMs that weren't recently introduced were producing intended behavior.

With respect to detecting implicit FROMs by noticing slow queries or unexpected behavior, that works OK with some queries, but it still takes significantly more human time than detecting things more explicitly and earlier in the pipeline.

There are cases where an under-constrained query runs faster, and fairly specific inputs would be required for a human to notice the bad behavior. Specifically, I think under-constrained EXISTS clauses frequently go undetected for some time. Some query that is supposed to check whether a particular row in a table has some relationship ends up checking whether any row in the table has some relationship. To make matters worse, it isn't entirely unusual for these two behaviors to be indistinguishable for typical data sets (perhaps because all the rows behave the same, or the tables are frequently singletons). In my worst nightmare, the intent might be to check if Alice has permission to do X to Y, but the query ends up asking whether Alice has permission to do X to anything, or whether anyone has permission to do X to Y.

There's lots of ways to write SQL queries that don't SELECT from the correct rows, or do get the correct rows but perform poorly (the most common is that indexes are missed), or a combination of both. Overall, you can't get away with only checking early or only checking late in the pipeline, there need to be good unit tests that cover every query used against decent test data, and there needs to be continuous analysis of performance and SQL emitted in the running application. I know people are doing this because that's how they alert me when the ORM is mis-interpreting a query.

Back to the issue at hand, I still maintain that providing a "works totally differently" mode is a recipe for headaches and complexity for everyone, adding more codepaths that add to the long term testing burden, diluting the usefulness of "implicit FROM" rather than fixing it, confusing users with a choice they shouldn't have to make, and at the end of the day providing "detects cartesian products" logic to only a subset of applications that opt in to this alternate mode and write their code in a totally new way.

Alternatively, logic that does do something like what I mentioned in my other email, that is a "linter" that can scan a select() for cartesian products would have the advantages of: 1. it can be "opt in" per environment, e.g. development and not production, so that any codebase can use it 2. it will find errors of the "forgot to add a JOIN" variety no matter how it occurred, whether it be user error at the Core or ORM level or if an ORM internal produced it 3. it would have no impact on existing applications or internals as it would be a self-contained routine that only runs if enabled. 4. it can be released as something "experimental" so that people can just play with it for a year or so having it only emit harmless warnings by default when turned on, until we find all the problems.

One disadvantage is that it adds a performance hit. This is mitigated by the fact that it is optional and can be enabled only in development / tests. It also would be part of the statement compilation step so would be part of what we cache when statement caching is used, which in 1.2 is used more prevalently than before within the internals and is also used by end-user code now in the form of the BakedQuery system.

Another disadvantage is that it is reinventing part of what the query planner is sort of doing already, but as you mentioned, it's more of a direct structural problem that is not as directly indicated within an EXPLAIN, and at least moves up the detection of the issue.

Finally, it may be problematic that it is not always correct, more in the case of emitting false positives, or even that there are some queries for which it can't work correctly. It would need to be seen how difficult this is.

Here is how it would work.

We are looking for "orphaned FROM" clauses, or even "split FROM clauses". It's a graph problem (since SQL is an expression tree, all these problems are just graph problems). Given a query like this:

select([a.something, b.something]).\
    select_from(b).join(c, a.c_id == c.id).\
    select_from(d).\
    where(
        and_(
             a.id == b.id,
             a.data == 10,
             d.data == 5,
             e.data = 10,
             d.e_data > some_func(e.thing)
        )
    )

we then create a graph of the selectables that come from our FROM list,
which are a, b.join(c), d, and e:

a --- b.join(c)   (joins on a.c_id == c.id)
a --- b.join(c)   (a.id == b.id)
d --- e           (d.e_data > some_func(e.thing))

then the graph shows that we have two disconnected graphs, the "abc" graph and the "de" graph; that fails the lint.

Another situation would be dealing with correlation:

subq = select([b.something]).where(b.id == a.id).correlate(a).as_scalar()

stmt = select([a.something, subq])

the graph for subq would be "a --- b", even though "a" is not in the FROM list, we know we are correlating to it at compile time so that is given as an adequate "link".

the graph for "stmt" would be just "a" because subq is not in the FROM list.

This algorithm can of course be written and used right now within a @compiles recipe as I stated before.

Keep in mind if you want to stick with the approach you have, you can still just do that within @compiles. But this graph-linter could be a neat thing, e.g. a mini-query planner that runs on the SQL expression tree. It could even look for Index() objects in the schema and try to anticipate table scans. Linters are popular, and this would likely be the first of its kind. If it's really something people want to work on, it would likely be popular.




The compile hook looks great and I will take a look at the test failures caused by reliance on these features in the ORM, if only to see whether we use or foresee using those features.

- Dave

On 6/24/2017 5:33 PM, Mike Bayer wrote:


On 06/24/2017 10:53 AM, Dave Vitek wrote:
Hi all,

I'll start with an example.  Assume A and B are Table objects:

 >>> print select([A.id], from_obj=[A], whereclause=(B.field == 123))
SELECT A.id FROM A, B WHERE B.id = 123

As a convenience, sqlalchemy has added B to the FROM clause on the assumption that I meant to do this.

However, in my experience, this behavior has hidden many bugs.

So yes, the implicit FROM behavior is for a very long time the ultimate symptom of many other bugs, probably moreso in the ORM itself within routines such as eager loading and joined table inheritance. A lot of ORM bugs manifest as this behavior. However, as often as this is the clear sign of a bug, it is actually not a bug in a whole bunch of other cases when the "FROM comma list" is what's intended.

 It isn't
a feature we intentionally rely on. Usually, someone forgot a join and the query sqlalchemy invents is drastically underconstrained. These bugs sometimes go unnoticed for a long time. I want to raise an exception when this happens instead of letting the computer take a guess at what the human meant.

So to continue from my previous statement that "comma in the FROM clause" is more often than not not what was intended, there is a large class of exceptions to this, which is all those times when this *is* what's intended.

Within the ORM, the most common place the FROM clause w/ comma is seen is in the lazy load with a "secondary" table. In modern versions, the two tables are added as explicit FROM, so your patch still allows this behavior to work, though in older versions it would have failed:

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


atob = Table(
    'atob', Base.metadata,
    Column('aid', ForeignKey('a.id'), primary_key=True),
    Column('bid', ForeignKey('b.id'), primary_key=True)
)


class A(Base):
    __tablename__ = 'a'
    id = Column(Integer, primary_key=True)
    bs = relationship("B", secondary=atob)


class B(Base):
    __tablename__ = 'b'
    id = Column(Integer, primary_key=True)

e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
s = Session(e)
s.add(A(bs=[B(), B()]))
s.commit()
a1 = s.query(A).first()
print a1.bs


lazyload at the end is:

SELECT b.id AS b_id
FROM b, atob
WHERE ? = atob.aid AND b.id = atob.bid

So this works with your patch because that routine is using explicit select_from(). However, there are a lot of other codepaths in the ORM that still rely on implicit FROM and will fail with this change.


I have modified _get_display_froms from selectable.py to return fewer FROM entities.

I am pleased that you have the motivation to dig in and work on this code, and you've produced a quality patch that has taken the time to understand how this works. So I think we can work here to possibly come up with a behavioral adjustment though it's not certain yet.


So far, this seems to be working.  Can you think of any
pitfalls with this change?

It's always a good idea to run the tests to see. The README.unittests.rst file has lots of information on how to do this but for a deep change like this it's as easy as "python setup.py test". There are lots of cases that break, many because you've asked that the feature be "global opt in" and that isn't yet implemented, but also many that would need to be fixed even if the "global opt in" is enabled.

I've pushed up your patch to gerrit here:

https://gerrit.sqlalchemy.org/#/c/441/

The builds there are both the main SQLAlchemy test suite as well as a selection of test suites from the Openstack project, and in this case both fail:

https://jenkins.sqlalchemy.org/job/sqlalchemy_gerrit/962/

https://jenkins.sqlalchemy.org/job/openstack_gerrit/824/

(note that the above links are only good for a few days as the builds get automatically truncated).

It's not surprising that the SQLAlchemy suite has failures, though it has a lot, but even the Openstack suite has failures and that is much more surprising, because Openstack does not use the fancy mapping patterns that we see failing in the main SQLAlchemy tests.

Within the SQLAlchemy suite, we have failures in the aaa_profiling suite, which is testing that the number of Python calls in a select() is limited to a known value; for these particular tests the guidelines are pretty strict, so the change adds a small performance overhead, for example:

Adjusted function call count 145 not within 5.0% of expected 157, platform 2.7_postgresql_psycopg2_dbapiunicode_cextensions

That's not a deal breaker as those tests are intended just to alert when the performance profile of something changes. Additionally, you've stated you would want this to be "opt-in", so one would be opting in to that small performance hit in any case.

The patch causes many failures just in the Core tutorial, meaning there are a bunch of basic SELECT examples that have been there for years that don't work under strict mode:

OperationalError: (sqlite3.OperationalError) no such column: users.fullname [SQL: u'SELECT users.fullname || ? || addresses.email_address AS title \nWHERE users.id = addresses.user_id AND users.name BETWEEN ? AND ? AND (addresses.email_address LIKE ? OR addresses.email_address LIKE ?)'] [parameters: (', ', 'm', 'z', '%@aol.com', '%@msn.com')]

But those are end-user queries and if they had opted in to the alternate behavior, they'd not be writing them that way, OK.

Then, lots of ORM-inheritance kinds of tests, CTE tests, and others break with this change, and in many cases, the Core / ORM itself is relying on implicit FROM taking place in an internal sense, so the application that "globally opts in" here would still fail on all these other cases, unless fixed. Many if not all of these breakages may be fixable by reworking all the various logic to no longer make use of the implicit FROM logic, however that would be lots of complexity and destabilization to make that happen.

One example is, using the Company/Person/Engineer set of models that are similar to those introduced at http://docs.sqlalchemy.org/en/latest/orm/inheritance.html#joined-table-inheritance, the following query fails:

    sess.query(Person)
                    .with_polymorphic(
                        Engineer,
                        people.outerjoin(engineers))
                    .order_by(Person.person_id)

The above query names only one entity, Person, and does not outwardly break any rules about adding additional FROM objects implicitly, however the logic that triggers when we go to load additional columns that have to do with Manager and Boss objects (another form of lazy load) renders an invalid SELECT:

OperationalError: (sqlite3.OperationalError) no such column: managers.manager_name [SQL: u'SELECT managers.manager_name AS managers_manager_name, managers.status AS managers_status, boss.golf_swing AS boss_golf_swing \nWHERE ? = managers.person_id AND managers.person_id = boss.boss_id'] [parameters: (3,)]

The query it wants to emit is:

SELECT managers.status AS managers_status, boss.golf_swing AS boss_golf_swing, managers.manager_name AS managers_manager_name
FROM managers, boss
WHERE ? = managers.person_id AND managers.person_id = boss.boss_id

So fixing cases like that means a lot of new work and testing, not to mention that all of these cases must be tested in *two* entirely different behavioral contracts, making this a much greater maintenance and testing burden. Any time we fix a new bug in the area of SELECT rendering in Core or ORM, we will often have to write twice as many tests to make sure they work in "alternate FROM mode" as well.

Over in the Openstack suite, here's a Keystone test that fails:

keystone.tests.unit.test_backend_sql.SqlIdentity.test_list_users_with_unique_id_and_idp_id

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: federated_user.user_id [SQL: u'SELECT user.enabled AS user_enabled, user.id AS user_id, user.domain_id AS user_domain_id, user.extra AS user_extra, user.default_project_id AS user_default_project_id, user.created_at AS user_created_at, user.last_active_at AS user_last_active_at \nFROM user LEFT OUTER JOIN local_user ON user.id = local_user.user_id AND user.domain_id = local_user.domain_id \nWHERE user.id = federated_user.user_id AND federated_user.unique_id = ? AND federated_user.idp_id = ?'] [parameters: ('9a5a4599df20426c9e9e6c86c8e36d3b', 'ORG_IDP')]

There's seven more in that variety in Keystone, looking at the query they are testing it is:

  query = session.query(model.User).outerjoin(model.LocalUser)
  query = query.filter(model.User.id == model.FederatedUser.user_id)
  query = self._update_query_with_federated_statements(hints, query)
  user_refs = sql.filter_limit_query(model.User, query, hints)
  return [identity_base.filter_user(x.to_dict()) for x in user_refs]

We can guess that they are probably making use of the implicit FROM feature within _update_query_with_federated_statements or filter_limit_query(). Curious, I dug into Keystone to see what query they intend to render; I thought hey, perhaps it would be funny if they actually were getting a cartesian product here and your feature has "found" it. Although, if that were the case, already I think this points more to the kind of feature I'd prefer, which is that instead of it just emitting the wrong SQL, it emits the SQL exactly as it does now but emits a warning; that way, everyone gets to know about their poor FROM listing and exactly why it is occurring, without having to "globally opt in" to anything:

SELECT user.enabled AS user_enabled, user.id AS user_id, user.domain_id AS user_domain_id, user.extra AS user_extra, user.default_project_id AS user_default_project_id, user.created_at AS user_created_at, user.last_active_at AS user_last_active_at FROM federated_user, user LEFT OUTER JOIN local_user ON user.id = local_user.user_id AND user.domain_id = local_user.domain_id WHERE user.id = federated_user.user_id AND federated_user.unique_id = ? AND federated_user.idp_id = ?

So above, we can see they are in fact using the dreaded comma in the FROM and even in an unusual way, "FROM table1, table2 JOIN table3" which is very odd for SQL, however they are setting up the join from federated_user to the user table properly, and they know what they are doing.

So let's think of "global opt in" from that perspective. Openstack application are enormous, sprawling apps that are developed and maintained by dozens of people who often haven't met each other. By requiring that the feature is only available as "global opt in", Keystone in order to use this feature would need to do the same work we did earlier; that they'd need to locate these queries and add in the explicit FROM clause, or otherwise alter their logic that is taking advantage of the implicit FROM to simplify their code. Additionally, developers who work on Keystone would observe that for some reason they don't understand, SQLAlchemy now behaves completely differently than it does when they work on another Openstack app like Nova that does not use this feature.

What would more likely happen is that Keystone just wouldn't make this change; "global opt in" is too coarse grained to make migrations towards using this feature simple and risk-free. That would lead to a balkanized chunk of very intricate code in the middle of Core that doesn't get used often, which is a recipe for heavy maintenance burdens, when the behavior of FROM logic itself needs to be maintained and now it must be maintained across two very different behavioral contracts, one of which is almost never used.

The original use case you had is, "the current behavior leads to subtle bugs". Why don't we re-think how to fix that directly, which I would propose is most easily grantable to the entire SQLAlchemy ecosystem at once by using a warning.

The warning would be, if someone does this:

stmt = select([t1.c.a]).select_from(t1).where(t2.c.b == 5)

that is, they put explicit FROM but then suggested a second implicit FROM, and therefore are mixing the approaches, would emit a warning, and that's it.

For your case, you'd see these warnings in your logs, and I would argue this is sufficient for you to prevent the thing you're concerned about,
which is the mixing of explicit / implicit FROM.

But I'm not even sure I'm comfortable with a warning here because as we see in Keystone, they are using this feature successfully. They shouldn't have to see a warning if they like how it works now; there is of course the warnings filter but I've observed people are rightly still annoyed about warnings that are perceived as undeserved.

So far I've spent a lot of time showing my thought process in great detail here because I hope to illustrate that yes, the FROM comma-list thing is a problem, but the use is too much a core of SQLAlchemy's being for it to really be changed, noting that I'd love to find a way to improve this situation. An alternative use contract is definitely an option, but as given as a patch to Core I fear this produces a so-called "balkanized block" that greatly adds to maintenance burden. I've had to remove lots of "pet features" over the years that I was too liberal in accepting up front. Usually, if we can really detect a pattern that "smells" of an error, and there's a way to explicitly prevent it, we can make a warning in that case, but even in that case I'm not sure people would appreciate being pushed away from "implicit from"; it's a handy feature when used correctly and simplifies code.

A more elaborate detection would be, look for implicit FROM, then look inside the WHERE clause and detect that in fact the multiple FROM entries aren't being related to each other. That logic would be very complicated as well as a performance hit, and would often be incorrect. We would essentially be reinventing a small fragment of what the database's query planner already does.

We can instead just ask the database using EXPLAIN to analyze the query, or even just turn on slow query logging. This is pretty much the answer I'd normally have given in this area in general; your developers may write an inefficient SQL statement whether or not SQLAlchemy makes one particular class of them more likely, and to really look for errors of this class, you should be looking at SQL logs, running EXPLAIN on queries you see, and setting up slow query logging on your development / CI servers. There are other query-planner error cases besides this one, such as missing indexes or other table-scan scenarios, gratuitous use of OFFSET, etc. There's lots of things that can make a query slow. Generalized safeguards in CI would guard against this one very specific class of error in any case.

After all that, what I can suggest for you is that since you are looking to "raise an error", that is actually easy to do here using events or compiles. All you need to do is compare the calculated FROM list to the explicit FROM list:

from sqlalchemy import table, column, select
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql import Select
from sqlalchemy import exc


@compiles(Select)
def _strict_froms(stmt, compiler, **kw):
    actual_froms = stmt.froms

    # TODO: tune this as desired
    if stmt._from_obj:
        # TODO: we can give you non-underscored access to
        # _from_obj
        legal_froms = stmt._from_obj
    else:
        legal_froms = set([
            col.table for col in stmt.inner_columns
        ])
    if len(legal_froms) != len(actual_froms):
        raise exc.CompileError("SELECT has implicit froms")

    return compiler.visit_select(stmt, **kw)


t1 = table('t1', column('a'))
t2 = table('t2', column('b'))


# works
print select([t1]).where(t2.c.b == 5).select_from(t1).select_from(t2)

# raises
print select([t1]).where(t2.c.b == 5)

The issue of "FROM with comma" *is* a problem I'd like to continue to explore solutions towards. I've long considered various rules in _display_froms but none have ever gotten through the various issues I've detailed above. I hope this helps to clarify the complexity and depth of this situation.





I've attached a copy of the patch in case you want to take a look. If there is interest in merging the change, it would need to be made opt-in (how is this best done globally--not in a per-query fashion?).

- Dave





--
SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Reply via email to