On Mon, Aug 31, 2015 at 12:09 PM, twoflower <standa.ku...@gmail.com> wrote:
> I have the following three tables: > > DOCUMENT > id (index) > documenttype > date_last_updated: timestamp(6) (indexed) > > EXTERNAL_TRANSLATION_UNIT > id (indexed) > fk_id_document (indexed) > > EXTERNAL_TRANSLATION > id (indexed) > fk_id_translation_unit (indexed) > > Table sizes: > DOCUMENT: 381 000 > EXTERNAL_TRANSLATION_UNIT: 76 000 000 > EXTERNAL_TRANSLATION: 76 000 000 > > Now the following query takes about 36 minutes to finish: > > SELECT u.id AS id_external_translation_unit, > r.id AS id_external_translation, > u.fk_id_language AS fk_id_source_language, > r.fk_id_language AS fk_id_target_language, > doc.fk_id_job > FROM "EXTERNAL_TRANSLATION_UNIT" u > JOIN "DOCUMENT" doc ON u.fk_id_document = doc.id > JOIN "EXTERNAL_TRANSLATION" r ON u.id = r.fk_id_translation_unit > WHERE doc.date_last_updated >= date(now() - '171:00:00'::interval) > ORDER BY r.id LIMIT 1000 > > This is the query plan: > > <http://postgresql.nabble.com/file/n5864045/qp1.png> > > If I remove the WHERE condition, it returns immediately. > > So does "SELECT 1;" - but since that doesn't give the same answer it is not very relevant. > Am I doing something obviously wrong? > Not obviously... > Thank you for any ideas. > Consider updating the translation tables at the same time the document table is updated. That way you can apply the WHERE and ORDER BY clauses against the same table. I presume you've run ANALYZE on the data. I would probably try something like: WITH docs AS ( SELECT ... WHERE date > ...) SELECT ... FROM (translations join translation_unit) t WHERE EXISTS (SELECT 1 FROM docs WHERE t.doc_id = docs.doc_id) ORDER BY t.id LIMIT 1000 You are trying to avoid the NESTED LOOP and the above has a decent chance of materializing docs and then building either a bit or hash map for both docs and translations thus performing a single sequential scan over both instead of performing 70+ million index lookups. Take this with a grain of salt as my fluency in this area is limited - I tend to work with trial-and-error but without data that is difficult. I'm not sure if the planner could be smarter because you are asking a question it is not particularly suited to estimating - namely cross-table correlations. Rethinking the model is likely to give you a better outcome long-term though it does seem like there should be room for improvement within the stated query and model. As Tomas said you likely will benefit from increased working memory in order to make materializing and hashing/bitmapping favorable compared to a nested loop. David J.