Re: Cross index join query performance

Upayavira Sat, 28 Sep 2013 00:44:31 -0700

The thing here is to understand how a join works.

Effectively, it does the inner query first, which results in a list of
terms. It then effectively does a multi-term query with those values.


q=size:large {!join fromIndex=other from=someid
to=someotherid}type:shirt

Imagine the inner join returned values A,B,C. Your inner query is, on
core 'other', q=type:shirt&fl=someid.

Then your outer query becomes size:large someotherid:(A B C)

Your inner query returns 25k values. You're having to do a multi-term
query for 25k terms. That is *bound* to be slow.

The pseudo-joins in Solr 4.x are intended for a small to medium number
of values returned by the inner query, otherwise performance degrades as
you are seeing.

Is there a way you can reduce the number of values returned by the inner
query?

As Joel mentions, those other joins are attempts to find other ways to
work with this limitation.

Upayavira

On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
> Hi Joel,
> 
> I tried this patch and it is quite a bit faster. Using the same query on
> a
> larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
> QTime was 100 msec! This was for true for large and small result sets.
> 
> A few notes: the patch didn't compile with 4.3 because of the
> SolrCore.getLatestSchema call (which I worked around), and the package
> name
> should be:
> <queryParser name="hjoin"
> class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
> 
> Unfortunately, I just learned that our uniqueKey may have to be an
> alphanumeric string instead of an int, so I'm not out of the woods yet.
> 
> Good stuff - thanks.
> 
> Peter
> 
> 
> On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
> 
> > It looks like you are using int join keys so you may want to check out
> > SOLR-4787, specifically the hjoin and bjoin.
> >
> > These perform well when you have a large number of results from the
> > fromIndex. If you have a small number of results in the fromIndex the
> > standard join will be faster.
> >
> >
> > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan <peterlkee...@gmail.com
> > >wrote:
> >
> > > I forgot to mention - this is Solr 4.3
> > >
> > > Peter
> > >
> > >
> > >
> > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <peterlkee...@gmail.com
> > > >wrote:
> > >
> > > > I'm doing a cross-core join query and the join query is 30X slower than
> > > > each of the 2 individual queries. Here are the queries:
> > > >
> > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > > > QTime: 5 msec
> > > > hit count: 1000
> > > >
> > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> > > 0.3]
> > > > QTime: 4 msec
> > > > hit count: 25K
> > > >
> > > > Join query:
> > > >
> > >
> > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid
> >  to=docid}fld1:[0.1 TO 0.3]
> > > > QTime: 160 msec
> > > > hit count: 205
> > > >
> > > > Here are the index spec's:
> > > >
> > > > mainindex size: 117K docs, 1 segment
> > > > mainindex schema:
> > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > required="true" multiValued="false" />
> > > >    <field name="title" type="text_en_splitting" indexed="true"
> > > > stored="true" multiValued="false" />
> > > >    <uniqueKey>docid</uniqueKey>
> > > >
> > > > subindex size: 117K docs, 1 segment
> > > > subindex schema:
> > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > required="true" multiValued="false" />
> > > >    <field name="fld1" type="float" indexed="true" stored="true"
> > > > required="false" multiValued="false" />
> > > >    <uniqueKey>docid</uniqueKey>
> > > >
> > > > With debugQuery=true I see:
> > > >   "debug":{
> > > >     "join":{
> > > >       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> > 0.3]":{
> > > >         "time":155,
> > > >         "fromSetSize":24742,
> > > >         "toSetSize":24742,
> > > >         "fromTermCount":117810,
> > > >         "fromTermTotalDf":117810,
> > > >         "fromTermDirectCount":117810,
> > > >         "fromTermHits":24742,
> > > >         "fromTermHitsTotalDf":24742,
> > > >         "toTermHits":24742,
> > > >         "toTermHitsTotalDf":24742,
> > > >         "toTermDirectCount":24627,
> > > >         "smallSetsDeferred":115,
> > > >         "toSetDocsAdded":24742}},
> > > >
> > > > Via profiler and debugger, I see 150 msec spent in the outer
> > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> > > like a
> > > > lot of time to join the bitsets. Does this seem right?
> > > >
> > > > Peter
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Professional Services LucidWorks
> >

Re: Cross index join query performance

Reply via email to