Re: Cross index join query performance

2013-09-30 Thread Peter Keegan
Ah, got it now - thanks for the explanation.


On Sat, Sep 28, 2013 at 3:33 AM, Upayavira u...@odoko.co.uk wrote:

 The thing here is to understand how a join works.

 Effectively, it does the inner query first, which results in a list of
 terms. It then effectively does a multi-term query with those values.

 q=size:large {!join fromIndex=other from=someid
 to=someotherid}type:shirt

 Imagine the inner join returned values A,B,C. Your inner query is, on
 core 'other', q=type:shirtfl=someid.

 Then your outer query becomes size:large someotherid:(A B C)

 Your inner query returns 25k values. You're having to do a multi-term
 query for 25k terms. That is *bound* to be slow.

 The pseudo-joins in Solr 4.x are intended for a small to medium number
 of values returned by the inner query, otherwise performance degrades as
 you are seeing.

 Is there a way you can reduce the number of values returned by the inner
 query?

 As Joel mentions, those other joins are attempts to find other ways to
 work with this limitation.

 Upayavira

 On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
  Hi Joel,
 
  I tried this patch and it is quite a bit faster. Using the same query on
  a
  larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
  QTime was 100 msec! This was for true for large and small result sets.
 
  A few notes: the patch didn't compile with 4.3 because of the
  SolrCore.getLatestSchema call (which I worked around), and the package
  name
  should be:
  queryParser name=hjoin
  class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/
 
  Unfortunately, I just learned that our uniqueKey may have to be an
  alphanumeric string instead of an int, so I'm not out of the woods yet.
 
  Good stuff - thanks.
 
  Peter
 
 
  On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com
  wrote:
 
   It looks like you are using int join keys so you may want to check out
   SOLR-4787, specifically the hjoin and bjoin.
  
   These perform well when you have a large number of results from the
   fromIndex. If you have a small number of results in the fromIndex the
   standard join will be faster.
  
  
   On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
I forgot to mention - this is Solr 4.3
   
Peter
   
   
   
On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 I'm doing a cross-core join query and the join query is 30X slower
 than
 each of the 2 individual queries. Here are the queries:

 Main query:
 http://localhost:8983/solr/mainindex/select?q=title:java
 QTime: 5 msec
 hit count: 1000

 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO
0.3]
 QTime: 4 msec
 hit count: 25K

 Join query:

   
  
 http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1
  TO 0.3]
 QTime: 160 msec
 hit count: 205

 Here are the index spec's:

 mainindex size: 117K docs, 1 segment
 mainindex schema:
field name=docid type=int indexed=true stored=true
 required=true multiValued=false /
field name=title type=text_en_splitting indexed=true
 stored=true multiValued=false /
uniqueKeydocid/uniqueKey

 subindex size: 117K docs, 1 segment
 subindex schema:
field name=docid type=int indexed=true stored=true
 required=true multiValued=false /
field name=fld1 type=float indexed=true stored=true
 required=false multiValued=false /
uniqueKeydocid/uniqueKey

 With debugQuery=true I see:
   debug:{
 join:{
   {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
   0.3]:{
 time:155,
 fromSetSize:24742,
 toSetSize:24742,
 fromTermCount:117810,
 fromTermTotalDf:117810,
 fromTermDirectCount:117810,
 fromTermHits:24742,
 fromTermHitsTotalDf:24742,
 toTermHits:24742,
 toTermHitsTotalDf:24742,
 toTermDirectCount:24627,
 smallSetsDeferred:115,
 toSetDocsAdded:24742}},

 Via profiler and debugger, I see 150 msec spent in the outer
 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This
 seems
like a
 lot of time to join the bitsets. Does this seem right?

 Peter


   
  
  
  
   --
   Joel Bernstein
   Professional Services LucidWorks
  



Re: Cross index join query performance

2013-09-28 Thread Upayavira
The thing here is to understand how a join works.

Effectively, it does the inner query first, which results in a list of
terms. It then effectively does a multi-term query with those values.

q=size:large {!join fromIndex=other from=someid
to=someotherid}type:shirt

Imagine the inner join returned values A,B,C. Your inner query is, on
core 'other', q=type:shirtfl=someid.

Then your outer query becomes size:large someotherid:(A B C)

Your inner query returns 25k values. You're having to do a multi-term
query for 25k terms. That is *bound* to be slow.

The pseudo-joins in Solr 4.x are intended for a small to medium number
of values returned by the inner query, otherwise performance degrades as
you are seeing.

Is there a way you can reduce the number of values returned by the inner
query?

As Joel mentions, those other joins are attempts to find other ways to
work with this limitation.

Upayavira

On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
 Hi Joel,
 
 I tried this patch and it is quite a bit faster. Using the same query on
 a
 larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
 QTime was 100 msec! This was for true for large and small result sets.
 
 A few notes: the patch didn't compile with 4.3 because of the
 SolrCore.getLatestSchema call (which I worked around), and the package
 name
 should be:
 queryParser name=hjoin
 class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/
 
 Unfortunately, I just learned that our uniqueKey may have to be an
 alphanumeric string instead of an int, so I'm not out of the woods yet.
 
 Good stuff - thanks.
 
 Peter
 
 
 On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com
 wrote:
 
  It looks like you are using int join keys so you may want to check out
  SOLR-4787, specifically the hjoin and bjoin.
 
  These perform well when you have a large number of results from the
  fromIndex. If you have a small number of results in the fromIndex the
  standard join will be faster.
 
 
  On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   I forgot to mention - this is Solr 4.3
  
   Peter
  
  
  
   On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
I'm doing a cross-core join query and the join query is 30X slower than
each of the 2 individual queries. Here are the queries:
   
Main query: http://localhost:8983/solr/mainindex/select?q=title:java
QTime: 5 msec
hit count: 1000
   
Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
   0.3]
QTime: 4 msec
hit count: 25K
   
Join query:
   
  
  http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docid
   to=docid}fld1:[0.1 TO 0.3]
QTime: 160 msec
hit count: 205
   
Here are the index spec's:
   
mainindex size: 117K docs, 1 segment
mainindex schema:
   field name=docid type=int indexed=true stored=true
required=true multiValued=false /
   field name=title type=text_en_splitting indexed=true
stored=true multiValued=false /
   uniqueKeydocid/uniqueKey
   
subindex size: 117K docs, 1 segment
subindex schema:
   field name=docid type=int indexed=true stored=true
required=true multiValued=false /
   field name=fld1 type=float indexed=true stored=true
required=false multiValued=false /
   uniqueKeydocid/uniqueKey
   
With debugQuery=true I see:
  debug:{
join:{
  {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
  0.3]:{
time:155,
fromSetSize:24742,
toSetSize:24742,
fromTermCount:117810,
fromTermTotalDf:117810,
fromTermDirectCount:117810,
fromTermHits:24742,
fromTermHitsTotalDf:24742,
toTermHits:24742,
toTermHitsTotalDf:24742,
toTermDirectCount:24627,
smallSetsDeferred:115,
toSetDocsAdded:24742}},
   
Via profiler and debugger, I see 150 msec spent in the outer
'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
   like a
lot of time to join the bitsets. Does this seem right?
   
Peter
   
   
  
 
 
 
  --
  Joel Bernstein
  Professional Services LucidWorks
 


Re: Cross index join query performance

2013-09-27 Thread Peter Keegan
Hi Joel,

I tried this patch and it is quite a bit faster. Using the same query on a
larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
QTime was 100 msec! This was for true for large and small result sets.

A few notes: the patch didn't compile with 4.3 because of the
SolrCore.getLatestSchema call (which I worked around), and the package name
should be:
queryParser name=hjoin
class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/

Unfortunately, I just learned that our uniqueKey may have to be an
alphanumeric string instead of an int, so I'm not out of the woods yet.

Good stuff - thanks.

Peter


On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com wrote:

 It looks like you are using int join keys so you may want to check out
 SOLR-4787, specifically the hjoin and bjoin.

 These perform well when you have a large number of results from the
 fromIndex. If you have a small number of results in the fromIndex the
 standard join will be faster.


 On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I forgot to mention - this is Solr 4.3
 
  Peter
 
 
 
  On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   I'm doing a cross-core join query and the join query is 30X slower than
   each of the 2 individual queries. Here are the queries:
  
   Main query: http://localhost:8983/solr/mainindex/select?q=title:java
   QTime: 5 msec
   hit count: 1000
  
   Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
  0.3]
   QTime: 4 msec
   hit count: 25K
  
   Join query:
  
 
 http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docid
  to=docid}fld1:[0.1 TO 0.3]
   QTime: 160 msec
   hit count: 205
  
   Here are the index spec's:
  
   mainindex size: 117K docs, 1 segment
   mainindex schema:
  field name=docid type=int indexed=true stored=true
   required=true multiValued=false /
  field name=title type=text_en_splitting indexed=true
   stored=true multiValued=false /
  uniqueKeydocid/uniqueKey
  
   subindex size: 117K docs, 1 segment
   subindex schema:
  field name=docid type=int indexed=true stored=true
   required=true multiValued=false /
  field name=fld1 type=float indexed=true stored=true
   required=false multiValued=false /
  uniqueKeydocid/uniqueKey
  
   With debugQuery=true I see:
 debug:{
   join:{
 {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
 0.3]:{
   time:155,
   fromSetSize:24742,
   toSetSize:24742,
   fromTermCount:117810,
   fromTermTotalDf:117810,
   fromTermDirectCount:117810,
   fromTermHits:24742,
   fromTermHitsTotalDf:24742,
   toTermHits:24742,
   toTermHitsTotalDf:24742,
   toTermDirectCount:24627,
   smallSetsDeferred:115,
   toSetDocsAdded:24742}},
  
   Via profiler and debugger, I see 150 msec spent in the outer
   'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
  like a
   lot of time to join the bitsets. Does this seem right?
  
   Peter
  
  
 



 --
 Joel Bernstein
 Professional Services LucidWorks



Re: Cross index join query performance

2013-09-26 Thread Joel Bernstein
It looks like you are using int join keys so you may want to check out
SOLR-4787, specifically the hjoin and bjoin.

These perform well when you have a large number of results from the
fromIndex. If you have a small number of results in the fromIndex the
standard join will be faster.


On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.comwrote:

 I forgot to mention - this is Solr 4.3

 Peter



 On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I'm doing a cross-core join query and the join query is 30X slower than
  each of the 2 individual queries. Here are the queries:
 
  Main query: http://localhost:8983/solr/mainindex/select?q=title:java
  QTime: 5 msec
  hit count: 1000
 
  Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
 0.3]
  QTime: 4 msec
  hit count: 25K
 
  Join query:
 
 http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindex
  from=docid to=docid}fld1:[0.1 TO 0.3]
  QTime: 160 msec
  hit count: 205
 
  Here are the index spec's:
 
  mainindex size: 117K docs, 1 segment
  mainindex schema:
 field name=docid type=int indexed=true stored=true
  required=true multiValued=false /
 field name=title type=text_en_splitting indexed=true
  stored=true multiValued=false /
 uniqueKeydocid/uniqueKey
 
  subindex size: 117K docs, 1 segment
  subindex schema:
 field name=docid type=int indexed=true stored=true
  required=true multiValued=false /
 field name=fld1 type=float indexed=true stored=true
  required=false multiValued=false /
 uniqueKeydocid/uniqueKey
 
  With debugQuery=true I see:
debug:{
  join:{
{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]:{
  time:155,
  fromSetSize:24742,
  toSetSize:24742,
  fromTermCount:117810,
  fromTermTotalDf:117810,
  fromTermDirectCount:117810,
  fromTermHits:24742,
  fromTermHitsTotalDf:24742,
  toTermHits:24742,
  toTermHitsTotalDf:24742,
  toTermDirectCount:24627,
  smallSetsDeferred:115,
  toSetDocsAdded:24742}},
 
  Via profiler and debugger, I see 150 msec spent in the outer
  'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
 like a
  lot of time to join the bitsets. Does this seem right?
 
  Peter
 
 




-- 
Joel Bernstein
Professional Services LucidWorks


Cross index join query performance

2013-09-25 Thread Peter Keegan
I'm doing a cross-core join query and the join query is 30X slower than
each of the 2 individual queries. Here are the queries:

Main query: http://localhost:8983/solr/mainindex/select?q=title:java
QTime: 5 msec
hit count: 1000

Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3]
QTime: 4 msec
hit count: 25K

Join query:
http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindex
toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
QTime: 160 msec
hit count: 205

Here are the index spec's:

mainindex size: 117K docs, 1 segment
mainindex schema:
   field name=docid type=int indexed=true stored=true
required=true multiValued=false /
   field name=title type=text_en_splitting indexed=true
stored=true multiValued=false /
   uniqueKeydocid/uniqueKey

subindex size: 117K docs, 1 segment
subindex schema:
   field name=docid type=int indexed=true stored=true
required=true multiValued=false /
   field name=fld1 type=float indexed=true stored=true
required=false multiValued=false /
   uniqueKeydocid/uniqueKey

With debugQuery=true I see:
  debug:{
join:{
  {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]:{
time:155,
fromSetSize:24742,
toSetSize:24742,
fromTermCount:117810,
fromTermTotalDf:117810,
fromTermDirectCount:117810,
fromTermHits:24742,
fromTermHitsTotalDf:24742,
toTermHits:24742,
toTermHitsTotalDf:24742,
toTermDirectCount:24627,
smallSetsDeferred:115,
toSetDocsAdded:24742}},

Via profiler and debugger, I see 150 msec spent in the outer
'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a
lot of time to join the bitsets. Does this seem right?

Peter