Re: Scaling/Parallel patterns for a View executing complex database transaction

Stephen J. Butler Thu, 14 May 2015 21:47:43 -0700

If it's just key-value storage, adding a memcached layer sounds like a
good thing to investigate. Do the tuples frequently change?


On Thu, May 14, 2015 at 11:30 PM, Me Sulphur <[email protected]> wrote:
> Hi Russ,
>
> Thanks! While there are some pointers that we can pick up from your answer,
> let me add some more details.
>
> * It is a large key-value pair table (approaching ~ 1 bn rows) with an MD5
> key and JSON  for value. The look ups depend on the business logic but are
> necessary. Nevertheless, there are no more than 10-12 queries executed by
> combining key look ups using WHERE IN queries and pg is surprisingly good
> with IN queries.
>
> * Pre-computation is theoretically possible but the permutations can be very
> limiting. Over the top of my head, I think it will be M^n where M and n are
> very large (and growing) integers.
>
> * You are right, celery+polling can be done but the API is already deployed
> and in use in environments where there is strong resistance to rework a code
> base which is stable and changing them (consumers of the API) is beyond
> control.
>
>
> Thanks!
>
>
> On Friday, 15 May 2015 06:06:57 UTC+5:30, Russell Keith-Magee wrote:
>>
>>
>> On Thu, May 14, 2015 at 6:03 PM, Me Sulphur <[email protected]> wrote:
>>>
>>> Stack: Django 1.7 + Postgres 9.3 + Linux (No caching)
>>>
>>> Our application has a view which is called/executed very frequently. The
>>> view receives, parses and responds a JSON.
>>>
>>> In between request and response, there are about 3-5 inserts and around
>>> 1200-5000 look ups depending upon some if..else business logic. At around
>>> 2-4 seconds the view is very slow.
>>>
>>> However, a lot of the look ups (which are the bottlenecks) can be
>>> parallelized. But I do not know how can I do the same within a
>>> request-response cycle.
>>>
>>> If it was a web UI, I could use celery+polling, since it a
>>> machine-machine API call, the parallelisation has to be possible within a
>>> View's life cycle.
>>>
>>> If parallelisation is not possible, what alternatives do I have for
>>> scaling and reducing response time.
>>>
>> The short answer is "it depends".
>>
>> There isn't a single answer - everything will depend on the specifics of
>> your problem space. All I can offer is some vague suggestions of places you
>> might be able to look for some extra speed.
>>
>>  * Do you *really* need to do 1200-5000 lookups? It's faster to do 1 query
>> returning 10 rows than 10 queries returning 1 row each. Can you optimise the
>> queries on the database to minimise the number of queries needed? Depending
>> on circumstances, it may even be faster to do 1 query returning 15 rows, and
>> then post-process in the view to throw away the 5 rows you don't need.
>>
>>  * Better still - can you optimize the database structure so that
>> 1200-5000 calls aren't needed? Modern relational databases are *really* good
>> at query optimisation - if you give it the right query, and the right
>> database.
>>
>>  * Can you work your algorithm another way? In a previous job, I worked on
>> a tool that would look at a database of thousands of news articles received
>> in a given day, and then, for each of thousands of users, work out which
>> articles were "interesting" so they could be sent out in a daily alert
>> email. The obvious algorithm for this is "for user in users:
>> find_articles(user)" - but, for a variety of reasons, it turned out that
>> doing "for article in articles: find_users(article)" was almost 2 orders of
>> magnitude of faster. The less obvious algorithm allowed much greater
>> caching, and massively cut down the number of queries that were required.
>> The tradeoff was a lot more memory (memory vs speed is almost always the
>> tradeoff), and it wasn't only faster if you computed the results for *all*
>> users at the same time - but this was an daily offline process, so these
>> were limitations we were willing to accept.
>>
>>  * To that end - is there anything that can be precomputed? Can you cache
>> pieces of the response? Is there anything you can put into a memory store,
>> rather than the database. Databases are great, but if you have a small
>> amount of frequently re-used, easily keyed data, it may be better to put
>> that data into a location where it can be obtained quickly, rather than
>> hitting the database.
>>
>>  * If you *must* parallelize, and your algorithm is conducive to it,
>> threads are probably your best option - work out what part of your algorithm
>> can be parallelized, and put each part in a thread, and merge the results
>> once all the threads complete. If you're on Python 3, look into the
>> concurrent.futures module (or the "futures" module if you're on Python 2) to
>> help make this easier to manage. However, threads aren't magic fairy dust -
>> my "limited knowledge of your situation" guess is that parallelization won't
>> help you. If you've got a frequently executed view doing thousands of
>> database calls, I'm going to guess the database is already a bit of a
>> bottleneck; adding 10 threads per request is going to increase the database
>> load and make performance *worse*, not better, and if it's a high traffic
>> view, at some point, you're going to hit the limit of the number of threads
>> your server can handle.
>>
>>  * Lastly, I'd challenge your assertion that this can't be done using a
>> celery + poll approach because it's a machine API. The fact that it's a
>> machine consuming the API doesn't matter; it's just a matter of what the
>> machine consumes. The public API for a "long running data processing
>> service" should be a 2 call API: (a) submit a processing request, and (b)
>> check the status of a specific processing request. If a human is consuming
>> this API, it can be put into a nice single-page app with an AJAX call
>> checking the status of the request. If it's a machine, you can make exactly
>> the same calls; you just treat it as a sequential sequence of calls with a
>> "while not finished: poll(); sleep(5)" loop. Of course, changing the API
>> style won't make your algorithm run any faster - it will just take the load
>> off your web server and allow you to offload processing to other servers.
>>
>> I hope some of this helps.
>>
>> Yours,
>> Russ Magee %-)
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/3f23de7e-1207-4dc6-ac7f-2c7a7edd6935%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAD4ANxX7wgKaYyFkCxBvGWtBxPcpeSE8Li1TmEJvFiPubOvjbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Scaling/Parallel patterns for a View executing complex database transaction

Reply via email to