[sqlalchemy] Re: Question about ShardedQuery Gevent parallelization

Carson Ip Wed, 22 Aug 2018 03:06:23 -0700

Sorry for the messed up word wrap. I find this Google Groups web editor 
hard to use.


On Wednesday, August 22, 2018 at 6:03:07 PM UTC+8, Carson Ip wrote:
>
> This is my first post here.
>
> Software / Library versions: (tho unrelated)
> sqlalchemy version: 1.0.19
> db: MySQL 5.6
> db driver: mysqlclient 1.3.7
>
> Background:
> I have 6 databases with a total of hundreds of shards. I realize when I 
> add more database hosts (and shards), my ShardedSession which queries all 
> the shards becomes slower. I looked into ShardedQuery and understand that 
> things run in serial. I am trying to add some gevent magic to parallelize 
> the queries. Please excuse my application design which did not allow me to 
> get the shard id by looking at the query target.
>
> My proposal:
>
> from gevent.pool import Group
> class GeventShardedQuery(ShardedQuery):
>     def _execute_and_instances(self, context):
>         def iter_for_shard(shard_id):
>             context.attributes['shard_id'] = shard_id
>             result = self._connection_from_session(
>                 mapper=self._mapper_zero(),
>                 shard_id=shard_id).execute(
>                 context.statement,
>                 self._params)
>             return self.instances(result, context)
>
>         if self._shard_id is not None:
>             return iter_for_shard(self._shard_id)
>         else:
>             g = Group()
>             values = [g.spawn(iter_for_shard, shard_id) for shard_id in 
> self.query_chooser(self)]
>             g.join()
>             partial = []
>             for v in values:
>                 partial.extend(v.value)
>
>             # if some kind of in memory 'sorting'
>             # were done, this is where it would happen
>             return iter(partial)
>
>
> Observation:
> In my benchmark, it will significantly (20-30% gain) speed up some 
> queries. I also see some cross-shard queries with no gain (that one is with 
> eager-loading).
>
> Here comes my questions:
> 1. Is it a good idea to actually parallelizing the queries? Will it mess 
> with some of the internals (e.g. sqlalchemy query context, db driver)? 
> Although I'm quite sure mysqlclient is ok with this.
> 2. 
> context.attributes['shard_id'] = shard_id
> assigning context.attributes['shard_id'] like this no longer works because 
> it will be overwritten by another greenlet because of the parallelization. 
> I suppose it is not a big deal to me because I don't need it anyway. But is 
> there an easy way to workaround this?
> 3. Is there any other problems with my code? Now I will create a 
> ShardedSession with the query_cls set as GeventShardedQuery.
>
> Thanks in advance!
>
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

[sqlalchemy] Re: Question about ShardedQuery Gevent parallelization

Reply via email to