If it's just key-value storage, adding a memcached layer sounds like a good thing to investigate. Do the tuples frequently change?
On Thu, May 14, 2015 at 11:30 PM, Me Sulphur <[email protected]> wrote: > Hi Russ, > > Thanks! While there are some pointers that we can pick up from your answer, > let me add some more details. > > * It is a large key-value pair table (approaching ~ 1 bn rows) with an MD5 > key and JSON for value. The look ups depend on the business logic but are > necessary. Nevertheless, there are no more than 10-12 queries executed by > combining key look ups using WHERE IN queries and pg is surprisingly good > with IN queries. > > * Pre-computation is theoretically possible but the permutations can be very > limiting. Over the top of my head, I think it will be M^n where M and n are > very large (and growing) integers. > > * You are right, celery+polling can be done but the API is already deployed > and in use in environments where there is strong resistance to rework a code > base which is stable and changing them (consumers of the API) is beyond > control. > > > Thanks! > > > On Friday, 15 May 2015 06:06:57 UTC+5:30, Russell Keith-Magee wrote: >> >> >> On Thu, May 14, 2015 at 6:03 PM, Me Sulphur <[email protected]> wrote: >>> >>> Stack: Django 1.7 + Postgres 9.3 + Linux (No caching) >>> >>> Our application has a view which is called/executed very frequently. The >>> view receives, parses and responds a JSON. >>> >>> In between request and response, there are about 3-5 inserts and around >>> 1200-5000 look ups depending upon some if..else business logic. At around >>> 2-4 seconds the view is very slow. >>> >>> However, a lot of the look ups (which are the bottlenecks) can be >>> parallelized. But I do not know how can I do the same within a >>> request-response cycle. >>> >>> If it was a web UI, I could use celery+polling, since it a >>> machine-machine API call, the parallelisation has to be possible within a >>> View's life cycle. >>> >>> If parallelisation is not possible, what alternatives do I have for >>> scaling and reducing response time. >>> >> The short answer is "it depends". >> >> There isn't a single answer - everything will depend on the specifics of >> your problem space. All I can offer is some vague suggestions of places you >> might be able to look for some extra speed. >> >> * Do you *really* need to do 1200-5000 lookups? It's faster to do 1 query >> returning 10 rows than 10 queries returning 1 row each. Can you optimise the >> queries on the database to minimise the number of queries needed? Depending >> on circumstances, it may even be faster to do 1 query returning 15 rows, and >> then post-process in the view to throw away the 5 rows you don't need. >> >> * Better still - can you optimize the database structure so that >> 1200-5000 calls aren't needed? Modern relational databases are *really* good >> at query optimisation - if you give it the right query, and the right >> database. >> >> * Can you work your algorithm another way? In a previous job, I worked on >> a tool that would look at a database of thousands of news articles received >> in a given day, and then, for each of thousands of users, work out which >> articles were "interesting" so they could be sent out in a daily alert >> email. The obvious algorithm for this is "for user in users: >> find_articles(user)" - but, for a variety of reasons, it turned out that >> doing "for article in articles: find_users(article)" was almost 2 orders of >> magnitude of faster. The less obvious algorithm allowed much greater >> caching, and massively cut down the number of queries that were required. >> The tradeoff was a lot more memory (memory vs speed is almost always the >> tradeoff), and it wasn't only faster if you computed the results for *all* >> users at the same time - but this was an daily offline process, so these >> were limitations we were willing to accept. >> >> * To that end - is there anything that can be precomputed? Can you cache >> pieces of the response? Is there anything you can put into a memory store, >> rather than the database. Databases are great, but if you have a small >> amount of frequently re-used, easily keyed data, it may be better to put >> that data into a location where it can be obtained quickly, rather than >> hitting the database. >> >> * If you *must* parallelize, and your algorithm is conducive to it, >> threads are probably your best option - work out what part of your algorithm >> can be parallelized, and put each part in a thread, and merge the results >> once all the threads complete. If you're on Python 3, look into the >> concurrent.futures module (or the "futures" module if you're on Python 2) to >> help make this easier to manage. However, threads aren't magic fairy dust - >> my "limited knowledge of your situation" guess is that parallelization won't >> help you. If you've got a frequently executed view doing thousands of >> database calls, I'm going to guess the database is already a bit of a >> bottleneck; adding 10 threads per request is going to increase the database >> load and make performance *worse*, not better, and if it's a high traffic >> view, at some point, you're going to hit the limit of the number of threads >> your server can handle. >> >> * Lastly, I'd challenge your assertion that this can't be done using a >> celery + poll approach because it's a machine API. The fact that it's a >> machine consuming the API doesn't matter; it's just a matter of what the >> machine consumes. The public API for a "long running data processing >> service" should be a 2 call API: (a) submit a processing request, and (b) >> check the status of a specific processing request. If a human is consuming >> this API, it can be put into a nice single-page app with an AJAX call >> checking the status of the request. If it's a machine, you can make exactly >> the same calls; you just treat it as a sequential sequence of calls with a >> "while not finished: poll(); sleep(5)" loop. Of course, changing the API >> style won't make your algorithm run any faster - it will just take the load >> off your web server and allow you to offload processing to other servers. >> >> I hope some of this helps. >> >> Yours, >> Russ Magee %-) > > -- > You received this message because you are subscribed to the Google Groups > "Django users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/django-users. > To view this discussion on the web visit > https://groups.google.com/d/msgid/django-users/3f23de7e-1207-4dc6-ac7f-2c7a7edd6935%40googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAD4ANxX7wgKaYyFkCxBvGWtBxPcpeSE8Li1TmEJvFiPubOvjbA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

