On Wed, Aug 22, 2018 at 6:03 AM, Carson Ip <carsonip...@gmail.com> wrote: > This is my first post here. > > Software / Library versions: (tho unrelated) > sqlalchemy version: 1.0.19 > db: MySQL 5.6 > db driver: mysqlclient 1.3.7 > > Background: > I have 6 databases with a total of hundreds of shards. I realize when I add > more database hosts (and shards), my ShardedSession which queries all the > shards becomes slower. I looked into ShardedQuery and understand that things > run in serial. I am trying to add some gevent magic to parallelize the > queries.
First step. Did you do Python profiling that shows that waiting for the databases to return results is where you actually have problems? Here's a recipe for code profiling: http://docs.sqlalchemy.org/en/latest/faq/performance.html#code-profiling. I see that you claim a 20-30% speedup in some cases however you will likely get the same or better speedup parallelizing with threads, and you will not have the danger of CPU-bound greenlets running which in the example you've given, you are definitely at risk of having. Specifically with MySQL if one of your connections is making a fresh MySQL connection to the database and another greenlet takes over running that often-CPU-expensive instances() call, the MySQL server will reject your connection for taking too long to respond in the authentication phase and you'll get an error (which at my job took me weeks to figure out). However, that issue is if you are actually using non-blocking IO. Since you are using the native mysqlclient driver, I believe gevent's monkeypatch of that runs it in a thread-pool anyway, the real non-blocking version is when you use pymysql instead. So it's unlikely you are actually using any non-blocking IO anyway, you're just using threads with a lot of gevent complexity around it. As far as non-blocking IO (which is the thing you are hoping to take advantage of when using gevent), I have been writing for years now, non-blocking IO does not help with concurrent relational database access in the general, CRUD-style case compared to using threads. Python is way more CPU bound than you'd like, and your non-blocking sockets will simply be waiting for your CPU to get around to attending to them. See my post at http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-databases/ with benchmarks, and in particular look at the "runsnakerun" images that show just how much work Python's interpreter does around a relatively small bit of actually waiting for the database. Whether you use thread or gevent, running the results into a single session using instances() is not thread/greenlet safe once you've finished getting the database result, so you would need to mutex around that part. Overall, if you were to do profiling, if your queries are scanning across dozens of shards, you are definitely spending lots of time compiling queries. In recent months, users have been getting by far the best performance boosts by reducing their CPU load using the baked query system (http://docs.sqlalchemy.org/en/latest/orm/extensions/baked.html). It depends very much on exactly what you are doing and if you can indeed copy the same query for all your shards without textual change but I'm pretty sure if you look in your code profiling you will see a lot of the time spent in query compilation which the baked system can bring down to nearly zero. > Please excuse my application design which did not allow me to get > the shard id by looking at the query target. > > My proposal: > > from gevent.pool import Group > class GeventShardedQuery(ShardedQuery): > def _execute_and_instances(self, context): > def iter_for_shard(shard_id): > context.attributes['shard_id'] = shard_id > result = self._connection_from_session( > mapper=self._mapper_zero(), > shard_id=shard_id).execute( > context.statement, > self._params) > return self.instances(result, context) > > if self._shard_id is not None: > return iter_for_shard(self._shard_id) > else: > g = Group() > values = [g.spawn(iter_for_shard, shard_id) for shard_id in > self.query_chooser(self)] > g.join() > partial = [] > for v in values: > partial.extend(v.value) > > # if some kind of in memory 'sorting' > # were done, this is where it would happen > return iter(partial) > > > Observation: > In my benchmark, it will significantly (20-30% gain) speed up some queries. > I also see some cross-shard queries with no gain (that one is with > eager-loading). > > Here comes my questions: > 1. Is it a good idea to actually parallelizing the queries? Will it mess > with some of the internals (e.g. sqlalchemy query context, db driver)? I'm not sure if the MySQL drivers are OK with using multiple cursors concurrently like that (e.g. you are querying concurrently from a single DB connection). I'm pretty sure pymysql isn't, mysqlclient might be OK with it, I havent tested in awhile. The next part, "instances()", that's not safe at all to run concurrently. It might be "working" in the greenlet if the results are totally fetched up front and there is no more IO happening however eager loads will run more IO. The state of the Session is highly modified by instances() and I can't guarantee concurrent threads/greenlets running that will not create problems. You'll want to put a mutex around that part that is scoped to the Session itself. > Although I'm quite sure mysqlclient is ok with this. > 2. > context.attributes['shard_id'] = shard_id > assigning context.attributes['shard_id'] like this no longer works because > it will be overwritten by another greenlet because of the parallelization. I > suppose it is not a big deal to me because I don't need it anyway. But is > there an easy way to workaround this? all of the work involving the Context needs to be mutexed. A possible workaround would be to use multiple QueryContext objects, read into multiple Session objects, then merge the results back into a single Session. But I think you'd find this adds a lot more overhead than is desirable. > 3. Is there any other problems with my code? Now I will create a > ShardedSession with the query_cls set as GeventShardedQuery. I usually am very nervous about the typical code we see with gevent, [gevent.spawn(thing) for index in possibly_enormous_list_of_things]. For example, if that list is hundreds of elements long (you said you had 100's of shards) do you want that many queries all in play at once (and if you had to use individual DB connections, would you want to spin up 100's of DB connections too), or would it be better to run queries in chunks. Anytime I do concurrency work, I always do things in chunks, which requires a little bit more programming overhead. But it makes for a runtime result that is a lot more predictable. I just wrote on SO about this yesterday in response to one of the many, many "gevent is faster than threads here's proof!" posts that are inevitably always incorrect: https://stackoverflow.com/a/51932442/34549 in that post, the simple approach of batching into a pool of threads is illustrated. As far as your code I would really need to evaluate a runnable proof of concept and also have some idea of how many shards you are actually querying at once and how you are setting up for those shards (e.g. do you have many hundreds of create_engine() calls?) > > Thanks in advance! > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to sqlalchemy+unsubscr...@googlegroups.com. > To post to this group, send email to sqlalchemy@googlegroups.com. > Visit this group at https://groups.google.com/group/sqlalchemy. > For more options, visit https://groups.google.com/d/optout. -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.