Re: Cross index join query performance
Ah, got it now - thanks for the explanation. On Sat, Sep 28, 2013 at 3:33 AM, Upayavira wrote: > The thing here is to understand how a join works. > > Effectively, it does the inner query first, which results in a list of > terms. It then effectively does a multi-term query with those values. > > q=size:large {!join fromIndex=other from=someid > to=someotherid}type:shirt > > Imagine the inner join returned values A,B,C. Your inner query is, on > core 'other', q=type:shirt&fl=someid. > > Then your outer query becomes size:large someotherid:(A B C) > > Your inner query returns 25k values. You're having to do a multi-term > query for 25k terms. That is *bound* to be slow. > > The pseudo-joins in Solr 4.x are intended for a small to medium number > of values returned by the inner query, otherwise performance degrades as > you are seeing. > > Is there a way you can reduce the number of values returned by the inner > query? > > As Joel mentions, those other joins are attempts to find other ways to > work with this limitation. > > Upayavira > > On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote: > > Hi Joel, > > > > I tried this patch and it is quite a bit faster. Using the same query on > > a > > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' > > QTime was 100 msec! This was for true for large and small result sets. > > > > A few notes: the patch didn't compile with 4.3 because of the > > SolrCore.getLatestSchema call (which I worked around), and the package > > name > > should be: > > > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/> > > > > Unfortunately, I just learned that our uniqueKey may have to be an > > alphanumeric string instead of an int, so I'm not out of the woods yet. > > > > Good stuff - thanks. > > > > Peter > > > > > > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein > > wrote: > > > > > It looks like you are using int join keys so you may want to check out > > > SOLR-4787, specifically the hjoin and bjoin. > > > > > > These perform well when you have a large number of results from the > > > fromIndex. If you have a small number of results in the fromIndex the > > > standard join will be faster. > > > > > > > > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan > > >wrote: > > > > > > > I forgot to mention - this is Solr 4.3 > > > > > > > > Peter > > > > > > > > > > > > > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan < > peterlkee...@gmail.com > > > > >wrote: > > > > > > > > > I'm doing a cross-core join query and the join query is 30X slower > than > > > > > each of the 2 individual queries. Here are the queries: > > > > > > > > > > Main query: > http://localhost:8983/solr/mainindex/select?q=title:java > > > > > QTime: 5 msec > > > > > hit count: 1000 > > > > > > > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO > > > > 0.3] > > > > > QTime: 4 msec > > > > > hit count: 25K > > > > > > > > > > Join query: > > > > > > > > > > > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1 > TO 0.3] > > > > > QTime: 160 msec > > > > > hit count: 205 > > > > > > > > > > Here are the index spec's: > > > > > > > > > > mainindex size: 117K docs, 1 segment > > > > > mainindex schema: > > > > > > > > > required="true" multiValued="false" /> > > > > > > > > > stored="true" multiValued="false" /> > > > > >docid > > > > > > > > > > subindex size: 117K docs, 1 segment > > > > > subindex schema: > > > > > > > > > required="true" multiValued="false" /> > > > > > > > > > required="false" multiValued="false" /> > > > > >docid > > > > > > > > > > With debugQuery=true I see: > > > > > "debug":{ > > > > > "join":{ > > > > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO > > > 0.3]":{ > > > > > "time":155, > > > > > "fromSetSize":24742, > > > > > "toSetSize":24742, > > > > > "fromTermCount":117810, > > > > > "fromTermTotalDf":117810, > > > > > "fromTermDirectCount":117810, > > > > > "fromTermHits":24742, > > > > > "fromTermHitsTotalDf":24742, > > > > > "toTermHits":24742, > > > > > "toTermHitsTotalDf":24742, > > > > > "toTermDirectCount":24627, > > > > > "smallSetsDeferred":115, > > > > > "toSetDocsAdded":24742}}, > > > > > > > > > > Via profiler and debugger, I see 150 msec spent in the outer > > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This > seems > > > > like a > > > > > lot of time to join the bitsets. Does this seem right? > > > > > > > > > > Peter > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Joel Bernstein > > > Professional Services LucidWorks > > > >
Re: Cross index join query performance
The thing here is to understand how a join works. Effectively, it does the inner query first, which results in a list of terms. It then effectively does a multi-term query with those values. q=size:large {!join fromIndex=other from=someid to=someotherid}type:shirt Imagine the inner join returned values A,B,C. Your inner query is, on core 'other', q=type:shirt&fl=someid. Then your outer query becomes size:large someotherid:(A B C) Your inner query returns 25k values. You're having to do a multi-term query for 25k terms. That is *bound* to be slow. The pseudo-joins in Solr 4.x are intended for a small to medium number of values returned by the inner query, otherwise performance degrades as you are seeing. Is there a way you can reduce the number of values returned by the inner query? As Joel mentions, those other joins are attempts to find other ways to work with this limitation. Upayavira On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote: > Hi Joel, > > I tried this patch and it is quite a bit faster. Using the same query on > a > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' > QTime was 100 msec! This was for true for large and small result sets. > > A few notes: the patch didn't compile with 4.3 because of the > SolrCore.getLatestSchema call (which I worked around), and the package > name > should be: > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/> > > Unfortunately, I just learned that our uniqueKey may have to be an > alphanumeric string instead of an int, so I'm not out of the woods yet. > > Good stuff - thanks. > > Peter > > > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein > wrote: > > > It looks like you are using int join keys so you may want to check out > > SOLR-4787, specifically the hjoin and bjoin. > > > > These perform well when you have a large number of results from the > > fromIndex. If you have a small number of results in the fromIndex the > > standard join will be faster. > > > > > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan > >wrote: > > > > > I forgot to mention - this is Solr 4.3 > > > > > > Peter > > > > > > > > > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan > > >wrote: > > > > > > > I'm doing a cross-core join query and the join query is 30X slower than > > > > each of the 2 individual queries. Here are the queries: > > > > > > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > > > > QTime: 5 msec > > > > hit count: 1000 > > > > > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO > > > 0.3] > > > > QTime: 4 msec > > > > hit count: 25K > > > > > > > > Join query: > > > > > > > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid > > to=docid}fld1:[0.1 TO 0.3] > > > > QTime: 160 msec > > > > hit count: 205 > > > > > > > > Here are the index spec's: > > > > > > > > mainindex size: 117K docs, 1 segment > > > > mainindex schema: > > > > > > > required="true" multiValued="false" /> > > > > > > > stored="true" multiValued="false" /> > > > >docid > > > > > > > > subindex size: 117K docs, 1 segment > > > > subindex schema: > > > > > > > required="true" multiValued="false" /> > > > > > > > required="false" multiValued="false" /> > > > >docid > > > > > > > > With debugQuery=true I see: > > > > "debug":{ > > > > "join":{ > > > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO > > 0.3]":{ > > > > "time":155, > > > > "fromSetSize":24742, > > > > "toSetSize":24742, > > > > "fromTermCount":117810, > > > > "fromTermTotalDf":117810, > > > > "fromTermDirectCount":117810, > > > > "fromTermHits":24742, > > > > "fromTermHitsTotalDf":24742, > > > > "toTermHits":24742, > > > > "toTermHitsTotalDf":24742, > > > > "toTermDirectCount":24627, > > > > "smallSetsDeferred":115, > > > > "toSetDocsAdded":24742}}, > > > > > > > > Via profiler and debugger, I see 150 msec spent in the outer > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems > > > like a > > > > lot of time to join the bitsets. Does this seem right? > > > > > > > > Peter > > > > > > > > > > > > > > > > > > > -- > > Joel Bernstein > > Professional Services LucidWorks > >
Re: Cross index join query performance
Hi Joel, I tried this patch and it is quite a bit faster. Using the same query on a larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' QTime was 100 msec! This was for true for large and small result sets. A few notes: the patch didn't compile with 4.3 because of the SolrCore.getLatestSchema call (which I worked around), and the package name should be: Unfortunately, I just learned that our uniqueKey may have to be an alphanumeric string instead of an int, so I'm not out of the woods yet. Good stuff - thanks. Peter On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein wrote: > It looks like you are using int join keys so you may want to check out > SOLR-4787, specifically the hjoin and bjoin. > > These perform well when you have a large number of results from the > fromIndex. If you have a small number of results in the fromIndex the > standard join will be faster. > > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan >wrote: > > > I forgot to mention - this is Solr 4.3 > > > > Peter > > > > > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan > >wrote: > > > > > I'm doing a cross-core join query and the join query is 30X slower than > > > each of the 2 individual queries. Here are the queries: > > > > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > > > QTime: 5 msec > > > hit count: 1000 > > > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO > > 0.3] > > > QTime: 4 msec > > > hit count: 25K > > > > > > Join query: > > > > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid > to=docid}fld1:[0.1 TO 0.3] > > > QTime: 160 msec > > > hit count: 205 > > > > > > Here are the index spec's: > > > > > > mainindex size: 117K docs, 1 segment > > > mainindex schema: > > > > > required="true" multiValued="false" /> > > > > > stored="true" multiValued="false" /> > > >docid > > > > > > subindex size: 117K docs, 1 segment > > > subindex schema: > > > > > required="true" multiValued="false" /> > > > > > required="false" multiValued="false" /> > > >docid > > > > > > With debugQuery=true I see: > > > "debug":{ > > > "join":{ > > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO > 0.3]":{ > > > "time":155, > > > "fromSetSize":24742, > > > "toSetSize":24742, > > > "fromTermCount":117810, > > > "fromTermTotalDf":117810, > > > "fromTermDirectCount":117810, > > > "fromTermHits":24742, > > > "fromTermHitsTotalDf":24742, > > > "toTermHits":24742, > > > "toTermHitsTotalDf":24742, > > > "toTermDirectCount":24627, > > > "smallSetsDeferred":115, > > > "toSetDocsAdded":24742}}, > > > > > > Via profiler and debugger, I see 150 msec spent in the outer > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems > > like a > > > lot of time to join the bitsets. Does this seem right? > > > > > > Peter > > > > > > > > > > > > -- > Joel Bernstein > Professional Services LucidWorks >
Re: Cross index join query performance
It looks like you are using int join keys so you may want to check out SOLR-4787, specifically the hjoin and bjoin. These perform well when you have a large number of results from the fromIndex. If you have a small number of results in the fromIndex the standard join will be faster. On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan wrote: > I forgot to mention - this is Solr 4.3 > > Peter > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan >wrote: > > > I'm doing a cross-core join query and the join query is 30X slower than > > each of the 2 individual queries. Here are the queries: > > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > > QTime: 5 msec > > hit count: 1000 > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO > 0.3] > > QTime: 4 msec > > hit count: 25K > > > > Join query: > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindex > from=docid to=docid}fld1:[0.1 TO 0.3] > > QTime: 160 msec > > hit count: 205 > > > > Here are the index spec's: > > > > mainindex size: 117K docs, 1 segment > > mainindex schema: > > > required="true" multiValued="false" /> > > > stored="true" multiValued="false" /> > >docid > > > > subindex size: 117K docs, 1 segment > > subindex schema: > > > required="true" multiValued="false" /> > > > required="false" multiValued="false" /> > >docid > > > > With debugQuery=true I see: > > "debug":{ > > "join":{ > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{ > > "time":155, > > "fromSetSize":24742, > > "toSetSize":24742, > > "fromTermCount":117810, > > "fromTermTotalDf":117810, > > "fromTermDirectCount":117810, > > "fromTermHits":24742, > > "fromTermHitsTotalDf":24742, > > "toTermHits":24742, > > "toTermHitsTotalDf":24742, > > "toTermDirectCount":24627, > > "smallSetsDeferred":115, > > "toSetDocsAdded":24742}}, > > > > Via profiler and debugger, I see 150 msec spent in the outer > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems > like a > > lot of time to join the bitsets. Does this seem right? > > > > Peter > > > > > -- Joel Bernstein Professional Services LucidWorks
Re: Cross index join query performance
I forgot to mention - this is Solr 4.3 Peter On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan wrote: > I'm doing a cross-core join query and the join query is 30X slower than > each of the 2 individual queries. Here are the queries: > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > QTime: 5 msec > hit count: 1000 > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3] > QTime: 4 msec > hit count: 25K > > Join query: > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex > toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3] > QTime: 160 msec > hit count: 205 > > Here are the index spec's: > > mainindex size: 117K docs, 1 segment > mainindex schema: > required="true" multiValued="false" /> > stored="true" multiValued="false" /> >docid > > subindex size: 117K docs, 1 segment > subindex schema: > required="true" multiValued="false" /> > required="false" multiValued="false" /> >docid > > With debugQuery=true I see: > "debug":{ > "join":{ > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{ > "time":155, > "fromSetSize":24742, > "toSetSize":24742, > "fromTermCount":117810, > "fromTermTotalDf":117810, > "fromTermDirectCount":117810, > "fromTermHits":24742, > "fromTermHitsTotalDf":24742, > "toTermHits":24742, > "toTermHitsTotalDf":24742, > "toTermDirectCount":24627, > "smallSetsDeferred":115, > "toSetDocsAdded":24742}}, > > Via profiler and debugger, I see 150 msec spent in the outer > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a > lot of time to join the bitsets. Does this seem right? > > Peter > >
Cross index join query performance
I'm doing a cross-core join query and the join query is 30X slower than each of the 2 individual queries. Here are the queries: Main query: http://localhost:8983/solr/mainindex/select?q=title:java QTime: 5 msec hit count: 1000 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3] QTime: 4 msec hit count: 25K Join query: http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3] QTime: 160 msec hit count: 205 Here are the index spec's: mainindex size: 117K docs, 1 segment mainindex schema: docid subindex size: 117K docs, 1 segment subindex schema: docid With debugQuery=true I see: "debug":{ "join":{ "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{ "time":155, "fromSetSize":24742, "toSetSize":24742, "fromTermCount":117810, "fromTermTotalDf":117810, "fromTermDirectCount":117810, "fromTermHits":24742, "fromTermHitsTotalDf":24742, "toTermHits":24742, "toTermHitsTotalDf":24742, "toTermDirectCount":24627, "smallSetsDeferred":115, "toSetDocsAdded":24742}}, Via profiler and debugger, I see 150 msec spent in the outer 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a lot of time to join the bitsets. Does this seem right? Peter