Re: [sqlalchemy] Question about ShardedQuery Gevent parallelization

Mike Bayer Wed, 22 Aug 2018 09:04:27 -0700

On Wed, Aug 22, 2018 at 6:03 AM, Carson Ip <carsonip...@gmail.com> wrote:
> This is my first post here.
>
> Software / Library versions: (tho unrelated)
> sqlalchemy version: 1.0.19
> db: MySQL 5.6
> db driver: mysqlclient 1.3.7
>
> Background:
> I have 6 databases with a total of hundreds of shards. I realize when I add
> more database hosts (and shards), my ShardedSession which queries all the
> shards becomes slower. I looked into ShardedQuery and understand that things
> run in serial. I am trying to add some gevent magic to parallelize the
> queries.


First step.  Did you do Python profiling that shows that waiting for
the databases to return results is where you actually have problems?
Here's a recipe for code profiling:
http://docs.sqlalchemy.org/en/latest/faq/performance.html#code-profiling.
   I see that you claim a 20-30% speedup in some cases however you
will likely get the same or better speedup parallelizing with threads,
and you will not have the danger of CPU-bound greenlets running which
in the example you've given, you are definitely at risk of having.
Specifically with MySQL if one of your connections is making a fresh
MySQL connection to the database and another greenlet takes over
running that often-CPU-expensive instances() call, the MySQL server
will reject your connection for taking too long to respond in the
authentication phase and you'll get an error (which at my job took me
weeks to figure out).   However, that issue is if you are actually
using non-blocking IO.  Since you are using the native mysqlclient
driver, I believe gevent's monkeypatch of that runs it in a
thread-pool anyway, the real non-blocking version is when you use
pymysql instead.  So it's unlikely you are actually using any
non-blocking IO anyway, you're just using threads with a lot of gevent
complexity around it.

As far as non-blocking IO (which is the thing you are hoping to take
advantage of when using gevent), I have been writing for years now,
non-blocking IO does not help with concurrent relational database
access in the general, CRUD-style case compared to using threads.
Python is way more CPU bound than you'd like, and your non-blocking
sockets will simply be waiting for your CPU to get around to attending
to them.  See my post at
http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-databases/
with benchmarks, and in particular look at the "runsnakerun" images
that show just how much work Python's interpreter does around a
relatively small bit of actually waiting for the database.

Whether you use thread or gevent, running the results into a single
session using instances() is not thread/greenlet safe once you've
finished getting the database result, so you would need to mutex
around that part.

Overall, if you were to do profiling, if your queries are scanning
across dozens of shards, you are definitely spending lots of time
compiling queries.  In recent months, users have been getting by far
the best performance boosts by reducing their CPU load using the baked
query system (http://docs.sqlalchemy.org/en/latest/orm/extensions/baked.html).
   It depends very much on exactly what you are doing and if you can
indeed copy the same query for all your shards without textual change
but I'm pretty sure if you look in your code profiling you will see a
lot of the time spent in query compilation which the baked system can
bring down to nearly zero.



> Please excuse my application design which did not allow me to get
> the shard id by looking at the query target.
>
> My proposal:
>
> from gevent.pool import Group
> class GeventShardedQuery(ShardedQuery):
>     def _execute_and_instances(self, context):
>         def iter_for_shard(shard_id):
>             context.attributes['shard_id'] = shard_id
>             result = self._connection_from_session(
>                 mapper=self._mapper_zero(),
>                 shard_id=shard_id).execute(
>                 context.statement,
>                 self._params)
>             return self.instances(result, context)
>
>         if self._shard_id is not None:
>             return iter_for_shard(self._shard_id)
>         else:
>             g = Group()
>             values = [g.spawn(iter_for_shard, shard_id) for shard_id in
> self.query_chooser(self)]
>             g.join()
>             partial = []
>             for v in values:
>                 partial.extend(v.value)
>
>             # if some kind of in memory 'sorting'
>             # were done, this is where it would happen
>             return iter(partial)
>
>
> Observation:
> In my benchmark, it will significantly (20-30% gain) speed up some queries.
> I also see some cross-shard queries with no gain (that one is with
> eager-loading).
>
> Here comes my questions:
> 1. Is it a good idea to actually parallelizing the queries? Will it mess
> with some of the internals (e.g. sqlalchemy query context, db driver)?

I'm not sure if the MySQL drivers are OK with using multiple cursors
concurrently like that (e.g. you are querying concurrently from a
single DB connection). I'm pretty sure pymysql isn't, mysqlclient
might be OK with it, I havent tested in awhile.

The next part, "instances()", that's not safe at all to run
concurrently.    It might be "working" in the greenlet if the results
are totally fetched up front and there is no more IO happening however
eager loads will run more IO.   The state of the Session is highly
modified by instances() and I can't guarantee concurrent
threads/greenlets running that will not create problems.   You'll want
to put a mutex around that part that is scoped to the Session itself.


> Although I'm quite sure mysqlclient is ok with this.
> 2.
> context.attributes['shard_id'] = shard_id
> assigning context.attributes['shard_id'] like this no longer works because
> it will be overwritten by another greenlet because of the parallelization. I
> suppose it is not a big deal to me because I don't need it anyway. But is
> there an easy way to workaround this?

all of the work involving the Context needs to be mutexed.   A
possible workaround would be to use multiple QueryContext objects,
read into multiple Session objects, then merge the results back into a
single Session.  But I think you'd find this adds a lot more overhead
than is desirable.


> 3. Is there any other problems with my code? Now I will create a
> ShardedSession with the query_cls set as GeventShardedQuery.

I usually am very nervous about the typical code we see with gevent,
[gevent.spawn(thing) for index in possibly_enormous_list_of_things].
  For example, if that list is hundreds of elements long (you said you
had 100's of shards) do you want that many queries all in play at once
(and if you had to use individual DB connections, would you want to
spin up 100's of DB connections too), or would it be better to run
queries in chunks.      Anytime I do concurrency work, I always do
things in chunks, which requires a little bit more programming
overhead.  But it makes for a runtime result that is a lot more
predictable.

I just wrote on SO about this yesterday in response to one of the
many, many "gevent is faster than threads here's proof!" posts that
are inevitably always incorrect:

https://stackoverflow.com/a/51932442/34549

in that post, the simple approach of batching into a pool of threads
is illustrated.

As far as your code I would really need to evaluate a runnable proof
of concept and also have some idea of how many shards you are actually
querying at once and how you are setting up for those shards (e.g. do
you have many hundreds of create_engine() calls?)





>
> Thanks in advance!
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Question about ShardedQuery Gevent parallelization

Reply via email to