Hi List,
We have a join perfomance issue and are not sure in which direction we should
look to solve the issue.
We currently only have a single node setup
We have 2 collections where we do join querys, joined by a "primary key" string
field contentId_s
Each dataset for a single contentId_s consists of multiple timecode based
documents in both indexes which makes this a many to many query.
collection1 - contains generic metadata and timecode based content (think
timecode based comments)
Documents: 382.872
Unique contentId_s: 16715
~ 160MB size
single shard
collection2 - contains timecode based GPS data (gps posititon, field of
view...timecodes are not related to timecodes in collection1, so flatten the
structure would blow up the number of documents to incredible numbers) :
Documents: 695.887.875
Unique contenId_s: 10199
~ 300 GB size
single shard
Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 64gb with
not much improvement) and 1TB SSD for the index
In our use case there is lots of indexing/deletion traffic on both indexes and
only few queries fired against the server.
We are constantly indexing new content and deleting old documents. This was
already getting problematic with HDDs so we switched to SDDs,
now indexing speed is fine for now (Might need also to scale this up in the
future to allow more throughput).
But search speed suffers when we need to join with the big collection2 (taking
up to 30sec for the query to succeed). We had some success experimenting with
score join queries when collection2 results only returns a few unique Ids, but
we can't predict that this is always the case, and if a lot of documents are
hit in collection2,
performance is 10x worse than with original normal join.
Sample queries look like this (simplified, but more complex queries are not
much slower):
Sample1:
query: coll1field:someval OR {!join from=contentId_s to=contentId_s
fromIndex=collection2 v='coll2field:someval}
filter: {!collapse field=contentId_s min=timecode_f}
Sample 2:
query: coll1field:someval
filter: {!join from=contentId_s to=contentId_s fromIndex=collection2
v='coll2field:otherval}
filter: {!collapse field=contentId_s min=timecode_f}
I experimented with running the query on collection2 alone first only to get
the numdocs (collapsing on contentId_s) to see how much results we get so we
could choose the right join query, but then with many hits in collection2 this
almost takes the same time as doing the join, so slow queries would get even
slower
Caches also seem to not help much since almost every query fired is different
and the index is mostly changing between requests anyways.
We are open to anything, adding nodes/hardware/shards/changing the index
structure...
Currently we don't know how to get around the big join
Any advice in which direction we should look ?