Re: Performance of cross join vs block join

Roman Chyla Fri, 12 Jul 2013 09:32:03 -0700

Hi Mikhail,
I have commented on your blog, but it seems I have done st wrong, as the
comment is not there. Would it be possible to share the test setup (script)?


I have found out that the crucial thing with joins is the number of 'joins'
[hits returned] and it seems that the experiments I have seen so far were
geared towards small collection - even if Erick's index was 26M, the number
of hits was probably small - you can see a very different story if you face
some [other] real data. Here is a citation network and I was comparing
lucene join's [ie not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment])

https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png

Notice, the y axes is sqrt, so the running time for lucene join is growing
and growing very fast! It takes lucene 30s to do the search that selects 1M
hits.

The comparison is against our own implementation of a similar search - but
the main point I am making is that the join benchmarks should be showing
the number of hits selected by the join operation. Otherwise, a very
important detail is hidden.

Best,

  roman


On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu <mihaela...@yahoo.com
> >wrote:
>
> > Hi Mikhail,
> >
> > I have used wrong the term block join. When I said block join I was
> > referring to a join performed on a single core versus cross join which
> was
> > performed on multiple cores.
> > But I saw your benchmark (from cache) and it seems that block join has
> > better performance. Is this functionality available on Solr 4.3.1?
>
> nope SOLR-3076 awaits for ages.
>
>
> > I did not find such examples on Solr's wiki page.
> > Does this functionality require a special schema, or a special indexing?
>
> Special indexing - yes.
>
>
> > How would I need to index the data from my tables? In my case anyway all
> > the indices have a common schema since I am using dynamic fields, thus I
> > can easily add all documents from all tables in one Solr core, but for
> each
> > document to add a discriminator field.
> >
> correct. but notion of ' discriminator field' is a little bit different for
> blockjoin.
>
>
> >
> > Could you point me to some more documentation?
> >
>
> I can recommend only those
>
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
> http://www.youtube.com/watch?v=-OiIlIijWH0
>
>
> > Thanks in advance,
> > Mihaela
> >
> >
> > ________________________________
> >  From: Mikhail Khludnev <mkhlud...@griddynamics.com>
> > To: solr-user <solr-user@lucene.apache.org>; mihaela olteanu <
> > mihaela...@yahoo.com>
> > Sent: Thursday, July 11, 2013 2:25 PM
> > Subject: Re: Performance of cross join vs block join
> >
> >
> > Mihaela,
> >
> > For me it's reasonable that single core join takes the same time as cross
> > core one. I just can't see which gain can be obtained from in the former
> > case.
> > I hardly able to comment join code, I looked into, it's not trivial, at
> > least. With block join it doesn't need to obtain parentId term
> > values/numbers and lookup parents by them. Both of these actions are
> > expensive. Also blockjoin works as an iterator, but join need to allocate
> > memory for parents bitset and populate it out of order that impacts
> > scalability.
> > Also in None scoring mode BJQ don't need to walk through all children,
> but
> > only hits first. Also, nice feature is 'both side leapfrog' if you have a
> > highly restrictive filter/query intersects with BJQ, it allows to skip
> many
> > parents and children as well, that's not possible in Join, which has
> fairly
> > 'full-scan' nature.
> > Main performance factor for Join is number of child docs.
> > I'm not sure I got all your questions, please specify them in more
> details,
> > if something is still unclear.
> > have you saw my benchmark
> > http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
> >
> >
> >
> > On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu <mihaela...@yahoo.com
> > >wrote:
> >
> > > Hello,
> > >
> > > Does anyone know about some measurements in terms of performance for
> > cross
> > > joins compared to joins inside a single index?
> > >
> > > Is it faster the join inside a single index that stores all documents
> of
> > > various types (from parent table or from children tables)with a
> > > discriminator field compared to the cross join (basically in this case
> > each
> > > document type resides in its own index)?
> > >
> > > I have performed some tests but to me it seems that having a join in a
> > > single index (bigger index) does not add too much speed improvements
> > > compared to cross joins.
> > >
> > > Why a block join would be faster than a cross join if this is the case?
> > > What are the variables that count when trying to improve the query
> > > execution time?
> > >
> > > Thanks!
> > > Mihaela
> >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > <mkhlud...@griddynamics.com>
>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>  <http://www.griddynamics.com>
> <mkhlud...@griddynamics.com>
>

Re: Performance of cross join vs block join

Reply via email to