innerJoin(intersect(innerJoin(collection1, collection2), innerJoin(collection 3, collection4)), collection5)
Let's focus on: innerJoin(collection 3, collection4)) The first thing to focus on is how fast is the export from collection4. You can test this with the NullStream with the following construct: null(search(collection4)) The null stream will eat all the tuples and report back timing information. This will isolate the performance of the export from collection4. Once you have a baseline for how fast you can export from a single node, you can test with parallel export from a single node: parallel(null(search(collection4))) Then you can add replicas for collection4 and increase workers. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Jun 1, 2017 at 11:51 PM, Susmit Shukla <shukla.sus...@gmail.com> wrote: > Hi, > > Which version of solr are you on? > Increasing memory may not be useful as streaming API does not keep stuff in > memory (except may be hash joins). > Increasing replicas (not sharding) and pushing the join computation on > worker solr cluster with #workers > 1 would definitely make things faster. > Are you limiting your results at some cutoff? if yes, then SOLR-10698 > <https://issues.apache.org/jira/browse/SOLR-10698> can be useful fix. Also > binary response format for streaming would be faster. (available in 6.5 > probably) > > > > On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan < > ecethiagu2...@yahoo.co.in.invalid> wrote: > > > We are working on a proposal and feeling streaming API along with export > > handler will best fit for our usecases. We are already of having a > > structure in solr in which we are using graph queries to produce > > hierarchical structure. Now from the structure we need to join couple of > > more collections. We have 5 different collections. > > Collection 1- 800 k records. > > Collection 2- 200k records. Collection > 3 > > - 7k records. Collection 4 - 6 > > million records. Collection 5 - 150 k records > > we are using the below strategy > > innerJoin( intersect( innerJoin(collection 1,collection 2), > > innerJoin(Collection 3, Collection 4)), collection 5). > > We are seeing performance is too slow when we start having > > collection 4. Just with collection 1 2 5 the results are coming in 2 > secs. > > The moment I have included collection 4 in the query I could see a > > performance impact. I believe exporting large results from collection 4 > is > > causing the issie. Currently I am using single sharded collection with no > > replica. I thinking if we can increase the memory as first option to > > increase performance as processing doc values need more memory. Then if > > that did not worked I can check using parallel stream/ sharding. Kindly > > advise is there could be anything else I missing? > > Sent from Yahoo Mail on Android >