Re: [sqlalchemy] Disable `session_identity_map` for `_instance_processor`

Mike Bayer Wed, 22 Nov 2017 14:57:45 -0800

I've added an issue to track this feature targeted for the moment at
1.3 (note that's for many months from now), however 1.3 has a lot in
it right now:


https://bitbucket.org/zzzeek/sqlalchemy/issues/4137/sharding-token-extension-in-identity-key



On Wed, Nov 22, 2017 at 4:20 PM, Mike Bayer <mike...@zzzcomputing.com> wrote:
> On Wed, Nov 22, 2017 at 4:56 AM, Антонио Антуан <a.ch....@gmail.com> wrote:
>> Glad to see that you remember my messages :)
>>
>> I've dived into `loading` module and I see that currently it is really
>> complicated to store additional data into pkey for each instance.
>>
>> Unfortunately, suggested solutions not good for my project.
>> Also, I think that `shard` meaning in my case is not the same as usual.
>>
>> I want to describe structure of out project, maybe it can help.
>>
>> Here is definition of our databases structure:
>> http://joxi.ru/nAyJVvGiXMv0Dr.
>> We got master db and several geo databases. Catalogs like `users`, `groups`,
>> `offers` and other are replicating to geo databases, so that data is always
>> the same.
>> But also we have tables like `clicks` and `leads`. Each app instance
>> contains the data about them in database, related to its geo:
>> europe-instance into europe-db, usa-instance into usa-database and so on.
>> Periodically master-app pulls clicks and leads to master-database. Synced
>> objects always have different ids into master- and get-db, so it is ok.
>>
>> But one time project owner came and said: "I need SAAS".
>> We see, that in current structure it's very hard (and really ugly) to
>> implement saas-solution. Amount of `Base*`, `Session*`, `Order*` and other
>> models will be multiplied with tenants amount.
>>
>> I discovered that I can override `get_bind` with another logic and it was
>> great: we can remove several `Base` classes, several Sessions and several
>> `Orders`.
>>
>> Mechanism looks like this:
>> - we use one instance on each geo for all tenants.
>> - we create separated databases for each tenant: this will be multiplied
>> with tenants amount: http://joxi.ru/nAyJVvGiXMv0Dr.
>> - we detect `tenant_id` using `request.host` (we use flask): each domain
>> binds with particular tenant;
>> - we store `tenant_id` into global storage.
>> - we use stored `tenant_id` into `Session.get_bind`:
>> https://gist.github.com/aCLr/9f578a2eeb225e7d65099ffb49aa8f3a
>> - into flask `teardown_request` we clear `tenant_id` storage and call
>> `Session.remove()`
>> - if we need to read from another get, just write `query =
>> query.set_shard_id(GEO)` `tenant_id`
>>
>> For celery we use this:
>> https://gist.github.com/aCLr/d8c5ac38956947da092375b2f89d7b50
>> Clear to.
>>
>> All this leads us only to pros, without cons: any developer has no need to
>> think about database chosing, just write code like there is only one
>> database. If you need to read from another geo-database, just call
>> `query.set_shard(GEO)`, tenant will be appended automatically to it.
>>
>> Problems begin when we tried to test non-flask and non-celery scripts, like
>> cron tasks: we may want to query several tenant-databases during one
>> SQLA-transaction, somethins like in my first example:
>> https://gist.github.com/aCLr/ff9462b634031ee6bccbead8d913c41f
>> (`assert_got_correct_objects_with_remove` and
>> `assert_got_cached_objects_without_remove`). The result you know.
>>
>>
>> During writing this message, I found out, that we need only one additional
>> data for primary key: `connection.bind.url`. I see, that SQLA already have
>> it inside `_instance_processor`, so it exists inside `_instance`. I think,
>> that `identity_key` should be constructed (in my case) with this code:
>> https://gist.github.com/aCLr/746f92dedb4d303a49033c0db22beced. Clear, don't
>> you think so?
>
> that's where something needs to happen but SQLAlchemy can't do this in
> such a way that is hardcoded to exactly your particular use case.   A
> Query can have lots of entities in it, and if you're doing sharding a
> single result set can refer to any number of shard identifiers within
> not just a single result set but within a single row; they might have
> come from dozens of different databases at once.  This could only be
> suited with a very open plugin point that is carefully architected,
> tested, and documented and I don't have the resources to envision this
> for a short-term use case.
>
>> Problems begin when we tried to test non-flask and non-celery scripts, like
>> cron tasks: we may want to query several tenant-databases during one
>> SQLA-transaction,
>
> I don't understand why you can't use independent sessions for this,
> because when you query two different databases, you are using two
> independent transactions in any case;  they are only coordinated if
> one is using two-phase transactions which I doubt is the case here
> (while SQLAlchemy put a lot of work into making that possible, I don't
> think anyone has ever used that feature).
>
> Instead of:
>
> Session.query(Foo).set_shard(id)
>
> you say:
>
> sessions.get(shard_id).query(Foo)
>
>
>
>
>
>>
>>
>>
>> вт, 21 нояб. 2017 г. в 19:15, Mike Bayer <mike...@zzzcomputing.com>:
>>>
>>> I've looked to see how hard it would be to allow "supplemental"
>>> attributes to form part of the mapper's primary key tuple, and it
>>> would be pretty hard.   The "easy" part is getting the mapper to set
>>> itself up with some extra attributes that can deliver some kind of
>>> supplemental value to the identity key.  the harder part is then in
>>> loading.py where we get new rows from the DB and need this
>>> value...which means some whole new kind of system would need to
>>> deliver this for any arbitrary part of the result set given a mapping
>>> and the selectable we're looking at (keep in mind a Query can have
>>> lots of the same mapping in a single row with aliases).   This would
>>> be very complicated to implement and test.   I am not seeing any quick
>>> way to suit this use case, which has also not ever been requested
>>> before.
>>>
>>>
>>>
>>> On Tue, Nov 21, 2017 at 10:12 AM, Mike Bayer <mike...@zzzcomputing.com>
>>> wrote:
>>> > On Tue, Nov 21, 2017 at 7:39 AM, Антонио Антуан <a.ch....@gmail.com>
>>> > wrote:
>>> >> Hi guys.
>>> >>
>>> >> I got this code example:
>>> >> https://gist.github.com/aCLr/ff9462b634031ee6bccbead8d913c41f.
>>> >>
>>> >> Here I make custom `Session` and custom `Query`. As you see, `Session`
>>> >> has
>>> >> several binds.
>>> >>
>>> >> Also, you can see that there are two functions:
>>> >> `assert_got_correct_objects_with_remove` and
>>> >> `assert_got_cached_objects_without_remove`.
>>> >>
>>> >> The first checks that we got correct results if `Session.remove`
>>> >> called.
>>> >> The second checks, that we got incorrect results if `Session.remove`
>>> >> not
>>> >> called.
>>> >>
>>> >> I understand, that behavior is correct: we don't remove session - so,
>>> >> we got
>>> >> same result from "cache-like"
>>> >> `sqlalchemy.orm.loading._instance_processor.session_identity_map`.
>>> >>
>>> >> I want to avoid that mechanism and don't want to use
>>> >> `session_identity_map`
>>> >> for different binds. In ideal, bind should be used as part of key for
>>> >> `session_identity_map`, but I guess, that it is not possible.
>>> >> Another way, acceptable for me: disable this mechanism. But I do not
>>> >> found
>>> >> ways to achieve this.
>>> >> And the third option: construct instances manually. Looks like I should
>>> >> copy
>>> >> code from `loading` module and add that method to `CustomSession`:
>>> >
>>> >
>>> > there's really no reason at all to use a "ShardedSession" if you have
>>> > overlapping primary key spaces from each of your binds.   I'm not sure
>>> > if I mentioned this at the beginning of the emails regarding this
>>> > project but I hope that I mentioned just using separate Session
>>> > objects is vastly simpler for non-intricate sharding cases, such as
>>> > where you always know which shard you care about and you don't care
>>> > about any of the others for a certain operation.     The point of
>>> > ShardedSession is so that objects pulled from multiple databases can
>>> > be intermingled in the same query and in the same transaction - which
>>> > by definition means they have unique primary keys.   If that's not
>>> > what you're doing here I don't see what advantage all this complexity
>>> > is getting you.
>>> >
>>> > If you're still convinced you need to be using a monolithic
>>> > ShardedSession then there needs to be some kind of translation of data
>>> > such that the mapper sees unique primary keys across the shards, or
>>> > unique classes.
>>> >
>>> > I've tried to think of ways to do this without too much difficulty but
>>> > none of them are really worth the complexity and hackiness it would
>>> > require.   The absolutely quickest and most well-supported, no hacks
>>> > required way would be to properly create a composite primary key on
>>> > your classes, where the second column is your shard id:
>>> >
>>> > class A(Base):
>>> >     __tablename__ = 'a'
>>> >
>>> >     id = Column(Integer, primary_key=True)
>>> >     shard_id = Column(Integer, primary_key=True)
>>> >
>>> > I tried to see if the "shard_id" column can be some kind of expression
>>> > that is not a Column on the Table but the mapper() is not set up to
>>> > support this unless you mapped the whole class to a select()
>>> > construct, which would make for too-complicated SQL, and you'd still
>>> > need to intercept this select() using events to put the right shard id
>>> > in.  Another is to create a custom column that renders in a special
>>> > way, but again you need to create events to intercept it in every case
>>> > to put the right shard id in, and/or remove it from things like
>>> > insert() statements.
>>> >
>>> > by far your two best solutions are: 1. use separate Session objects
>>> > per shard  2. make sure your data actually has shard-specific primary
>>> > keys
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >>
>>> >> def instances(self, cursor, __context=None):
>>> >>     context = __context
>>> >>     if context is None:
>>> >>         context = QueryContext(self)
>>> >>     return self._custom_instances(self, cursor, context)
>>> >>
>>> >>
>>> >>
>>> >> def custom_instances(query, cursor, context):
>>> >>      """copied from `loading.instances` code with disabled
>>> >> `session_identity_map`"""
>>> >>
>>> >>
>>> >>
>>> >> The third way is the most ugly and I want to avoid it.
>>> >>
>>> >> Could you help me with my hard choice and, maybe, suggest any other
>>> >> ways and
>>> >> options? :)
>>> >>
>>> >> Thank you.
>>> >>
>>> >> --
>>> >> SQLAlchemy -
>>> >> The Python SQL Toolkit and Object Relational Mapper
>>> >>
>>> >> http://www.sqlalchemy.org/
>>> >>
>>> >> To post example code, please provide an MCVE: Minimal, Complete, and
>>> >> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>>> >> description.
>>> >> ---
>>> >> You received this message because you are subscribed to the Google
>>> >> Groups
>>> >> "sqlalchemy" group.
>>> >> To unsubscribe from this group and stop receiving emails from it, send
>>> >> an
>>> >> email to sqlalchemy+unsubscr...@googlegroups.com.
>>> >> To post to this group, send email to sqlalchemy@googlegroups.com.
>>> >> Visit this group at https://groups.google.com/group/sqlalchemy.
>>> >> For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> SQLAlchemy -
>>> The Python SQL Toolkit and Object Relational Mapper
>>>
>>> http://www.sqlalchemy.org/
>>>
>>> To post example code, please provide an MCVE: Minimal, Complete, and
>>> Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full
>>> description.
>>> ---
>>> You received this message because you are subscribed to the Google Groups
>>> "sqlalchemy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to sqlalchemy+unsubscr...@googlegroups.com.
>>> To post to this group, send email to sqlalchemy@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>>
>> Антон
>>
>> --
>> SQLAlchemy -
>> The Python SQL Toolkit and Object Relational Mapper
>>
>> http://www.sqlalchemy.org/
>>
>> To post example code, please provide an MCVE: Minimal, Complete, and
>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>> description.
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "sqlalchemy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to sqlalchemy+unsubscr...@googlegroups.com.
>> To post to this group, send email to sqlalchemy@googlegroups.com.
>> Visit this group at https://groups.google.com/group/sqlalchemy.
>> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Disable `session_identity_map` for `_instance_processor`

Reply via email to