Re: [sqlalchemy] Question about ShardedQuery Gevent parallelization

Carson Ip Thu, 23 Aug 2018 08:38:10 -0700

Hi Mike,

Thanks for your detailed reply and your work on sqlalchemy. When I was 
researching about the issue, I have already read your blog post about 
asyncio.
It was very insightful.


Let me briefly describe my setup.
Benchmark setup:
2 MySQL database nodes, total 90 shards (databases).


On Thursday, August 23, 2018 at 12:04:16 AM UTC+8, Mike Bayer wrote:
>
> On Wed, Aug 22, 2018 at 6:03 AM, Carson Ip <carso...@gmail.com 
> <javascript:>> wrote: 
> > This is my first post here. 
> > 
> > Software / Library versions: (tho unrelated) 
> > sqlalchemy version: 1.0.19 
> > db: MySQL 5.6 
> > db driver: mysqlclient 1.3.7 
> > 
> > Background: 
> > I have 6 databases with a total of hundreds of shards. I realize when I 
> add 
> > more database hosts (and shards), my ShardedSession which queries all 
> the 
> > shards becomes slower. I looked into ShardedQuery and understand that 
> things 
> > run in serial. I am trying to add some gevent magic to parallelize the 
> > queries. 
>
> First step.  Did you do Python profiling that shows that waiting for 
> the databases to return results is where you actually have problems? 
> Here's a recipe for code profiling: 
> http://docs.sqlalchemy.org/en/latest/faq/performance.html#code-profiling. 
>    I see that you claim a 20-30% speedup in some cases however you 
> will likely get the same or better speedup parallelizing with threads, 
> and you will not have the danger of CPU-bound greenlets running which 
> in the example you've given, you are definitely at risk of having. 
> Specifically with MySQL if one of your connections is making a fresh 
> MySQL connection to the database and another greenlet takes over 
> running that often-CPU-expensive instances() call, the MySQL server 
> will reject your connection for taking too long to respond in the 
> authentication phase and you'll get an error (which at my job took me 
> weeks to figure out).   However, that issue is if you are actually 
> using non-blocking IO.  Since you are using the native mysqlclient 
> driver, I believe gevent's monkeypatch of that runs it in a 
> thread-pool anyway, the real non-blocking version is when you use 
> pymysql instead.  So it's unlikely you are actually using any 
> non-blocking IO anyway, you're just using threads with a lot of gevent 
> complexity around it. 
>
>

When I profile my code with a ShardedSession which would query 
all 90 shards, 15% of the time is gevent's wait function, and 12% 
is mysql's rollback function (due to close). Others are insignificant 
(~1%). This confirms my assumption of "most of the time is waiting
for the sequential queries across shards".

When I parallelize them (by modifying _execute_and_instances), 
such a bottleneck goes away. Of course now I realize this is NOT
the way to go after reading your detailed explanation.

I was naive and assumed that parallelizing things will reduce the wait.
Thanks to your response, I am now making progress. I am trying to 
not mess with the internals and do the parallelization at a higher level.


Let me show you my benchmarks:

It is a very simple "SELECT id FROM table WHERE field = target".

Global (cross shard ShardedSession): 0.195 (seconds)
Global with baked query: 0.189
Parallel (1 session per shard, shard-parallelized): 0.170
Parallel with baked query: 0.160
Chunked with baked (1 session per shard but parallelized per mysql node): 
0.129
Raw Parallel (Using mysqldb directly, shard-parallelized): 0.102
Raw Serial (Mysqldb, sequential): 0.173

Observations:

1. The query is only run once therefore global with baked yielded
insignificant gain since the the query is generated and used once
only. For the same reason, parallel with baked query works better
than parallel alone since the baked query is reused.

2. As you have suggested, the chunked has the best result with 
sqlalchemy. My guess is it did not bombard MySQL with the queries
and did not push Python too hard at a time.

3. The mysqldb benchmarks exist for baseline purposes.

Off topic:
In the application, I also added a cache to cache the shard_id of
the requested object. The numbers do look much better, but when 
there is a cache miss, the cross-shard query is still necessary.

 

> As far as non-blocking IO (which is the thing you are hoping to take 
> advantage of when using gevent), I have been writing for years now, 
> non-blocking IO does not help with concurrent relational database 
> access in the general, CRUD-style case compared to using threads. 
> Python is way more CPU bound than you'd like, and your non-blocking 
> sockets will simply be waiting for your CPU to get around to attending 
> to them.  See my post at 
> http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-databases/ 
> with benchmarks, and in particular look at the "runsnakerun" images 
> that show just how much work Python's interpreter does around a 
> relatively small bit of actually waiting for the database. 
>
> Whether you use thread or gevent, running the results into a single 
> session using instances() is not thread/greenlet safe once you've 
> finished getting the database result, so you would need to mutex 
> around that part. 
>
> Overall, if you were to do profiling, if your queries are scanning 
> across dozens of shards, you are definitely spending lots of time 
> compiling queries.  In recent months, users have been getting by far 
> the best performance boosts by reducing their CPU load using the baked 
> query system (
> http://docs.sqlalchemy.org/en/latest/orm/extensions/baked.html). 
>    It depends very much on exactly what you are doing and if you can 
> indeed copy the same query for all your shards without textual change 
> but I'm pretty sure if you look in your code profiling you will see a 
> lot of the time spent in query compilation which the baked system can 
> bring down to nearly zero. 
>
>
>
> > Please excuse my application design which did not allow me to get 
> > the shard id by looking at the query target. 
> > 
> > My proposal: 
> > 
> > from gevent.pool import Group 
> > class GeventShardedQuery(ShardedQuery): 
> >     def _execute_and_instances(self, context): 
> >         def iter_for_shard(shard_id): 
> >             context.attributes['shard_id'] = shard_id 
> >             result = self._connection_from_session( 
> >                 mapper=self._mapper_zero(), 
> >                 shard_id=shard_id).execute( 
> >                 context.statement, 
> >                 self._params) 
> >             return self.instances(result, context) 
> > 
> >         if self._shard_id is not None: 
> >             return iter_for_shard(self._shard_id) 
> >         else: 
> >             g = Group() 
> >             values = [g.spawn(iter_for_shard, shard_id) for shard_id in 
> > self.query_chooser(self)] 
> >             g.join() 
> >             partial = [] 
> >             for v in values: 
> >                 partial.extend(v.value) 
> > 
> >             # if some kind of in memory 'sorting' 
> >             # were done, this is where it would happen 
> >             return iter(partial) 
> > 
> > 
> > Observation: 
> > In my benchmark, it will significantly (20-30% gain) speed up some 
> queries. 
> > I also see some cross-shard queries with no gain (that one is with 
> > eager-loading). 
> > 
> > Here comes my questions: 
> > 1. Is it a good idea to actually parallelizing the queries? Will it mess 
> > with some of the internals (e.g. sqlalchemy query context, db driver)? 
>
> I'm not sure if the MySQL drivers are OK with using multiple cursors 
> concurrently like that (e.g. you are querying concurrently from a 
> single DB connection). I'm pretty sure pymysql isn't, mysqlclient 
> might be OK with it, I havent tested in awhile. 
>
> The next part, "instances()", that's not safe at all to run 
> concurrently.    It might be "working" in the greenlet if the results 
> are totally fetched up front and there is no more IO happening however 
> eager loads will run more IO.   The state of the Session is highly 
> modified by instances() and I can't guarantee concurrent 
> threads/greenlets running that will not create problems.   You'll want 
> to put a mutex around that part that is scoped to the Session itself. 
>
>
> > Although I'm quite sure mysqlclient is ok with this. 
> > 2. 
> > context.attributes['shard_id'] = shard_id 
> > assigning context.attributes['shard_id'] like this no longer works 
> because 
> > it will be overwritten by another greenlet because of the 
> parallelization. I 
> > suppose it is not a big deal to me because I don't need it anyway. But 
> is 
> > there an easy way to workaround this? 
>
> all of the work involving the Context needs to be mutexed.   A 
> possible workaround would be to use multiple QueryContext objects, 
> read into multiple Session objects, then merge the results back into a 
> single Session.  But I think you'd find this adds a lot more overhead 
> than is desirable. 
>
>
> > 3. Is there any other problems with my code? Now I will create a 
> > ShardedSession with the query_cls set as GeventShardedQuery. 
>
> I usually am very nervous about the typical code we see with gevent, 
> [gevent.spawn(thing) for index in possibly_enormous_list_of_things]. 
>   For example, if that list is hundreds of elements long (you said you 
> had 100's of shards) do you want that many queries all in play at once 
> (and if you had to use individual DB connections, would you want to 
> spin up 100's of DB connections too), or would it be better to run 
> queries in chunks.      Anytime I do concurrency work, I always do 
> things in chunks, which requires a little bit more programming 
> overhead.  But it makes for a runtime result that is a lot more 
> predictable. 
>
> I just wrote on SO about this yesterday in response to one of the 
> many, many "gevent is faster than threads here's proof!" posts that 
> are inevitably always incorrect: 
>
> https://stackoverflow.com/a/51932442/34549 
>
> in that post, the simple approach of batching into a pool of threads 
> is illustrated. 
>
> As far as your code I would really need to evaluate a runnable proof 
> of concept and also have some idea of how many shards you are actually 
> querying at once and how you are setting up for those shards (e.g. do 
> you have many hundreds of create_engine() calls?) 
>
>
>

And yes I do have one create_engine call per shard. You would say my DBA 
would not be happy but I am my own DBA so it's fine. Each engine has a pool 
of a few connections. It sums up to hundreds of connections. I wish I could 
create_engine per node instead of per shard and switch between shards when 
working. I once got OOM in MySQL due to the huge amount of connections.

Off topic: I discovered a bug of BakedQuery when using it with a 
ShardedSession 
first then a usual session. It will crash. Let me create a proper bug 
report and 
probably a PR later.

Once again, thanks a lot for the insights. These are all quite new to me.
 


>
>
> > 
> > Thanks in advance! 
> > 
> > -- 
> > SQLAlchemy - 
> > The Python SQL Toolkit and Object Relational Mapper 
> > 
> > http://www.sqlalchemy.org/ 
> > 
> > To post example code, please provide an MCVE: Minimal, Complete, and 
> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> > description. 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "sqlalchemy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to sqlalchemy+...@googlegroups.com <javascript:>. 
> > To post to this group, send email to sqlal...@googlegroups.com 
> <javascript:>. 
> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Question about ShardedQuery Gevent parallelization

Reply via email to