Re: revisit naming for grouping/join?

mark harwood Fri, 01 Jul 2011 06:28:43 -0700

>> I think what would be best is a smallish but feature complete demo,

For the nested stuff I had a reasonable demo on LUCENE-2454 that was based 
around resumes - that use case has the one-to-many characteristics that lends 
itself to nested e.g. a person has many different qualifications and records of 
employment.
This scenario was illustrated 
here: 
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

I also had the "book search" type scenario where a book has many sections and 
for the purposes of efficient highlighting/summarisation  these sections were 
treated as child docs which could be read quickly (rather than highlighting a 
whole book)

I'm not sure what the "parent" was in your doctor and cities example, Mike. If 
a 
doctor is in only one city then there is no point making city a child doc as 
the 
one city info can happily be combined with the doctor info into a single 
document with no conflict (doctors have different properties to cities).
If the city is the parent with many child doctor docs that makes more sense but 
feels like a less likely use case e.g. "find me a city with doctor x and a 
different doctor y"
Searching for a person with excellent java and prefrerably good lucene skills 
feels like a more real-world example.

It feels like documenting some of the trade-offs behind index design choices is 
useful too e.g. nesting is not too great for very volatile content with 
constantly changing children while search-time join is more costly in RAM and 
2-pass processing

Cheers
Mark

----- Original Message ----
From: Michael McCandless <luc...@mikemccandless.com>
To: dev@lucene.apache.org
Sent: Fri, 1 July, 2011 13:51:04
Subject: Re: revisit naming for grouping/join?

I think joining and grouping are two different functions, and we
should keep different modules for them...

On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir <rcm...@gmail.com> wrote:
> Hi,
>
> when looking at just a very quick glance at some of the newer
> grouping/join features, I found myself a little confused about what is
> exactly what, and I think users might too.

They are confusing!

> I discussed some of this with hossman, and it only seemed to make me
> even more totally confused about:
> * difference between field collapsing and grouping

I like the name grouping better here: I think field collapsing
undersells (it's only one specific way to use grouping).  EG, grouping
w/o collapsing is useful (eg, Best Buy grouping hits by product
category and showing the top 5 in each).

> * difference between nested documents and the index-time join

Similarly I think nested docs undersells index-time join: you can
join (either during indexing or during searching) in many different
ways, and nested docs is just one use case.

EG, maybe your docs are doctors but during indexing you join to a city
table with facts about that city (each doctor's office is in a
specific city) and then you want to run queries like "city's avg
annual temp > 60 and doctor has good bedside manner" or something.

> * difference between index-time-join/nested documents and single-pass
> index-time grouping. Is the former only a more general case of the
> latter?

Grouping is purely a presentation concern -- you are not altering
which docs hit; you are simply changing how you pick which hits to
display ("top N by group").  So we only have collectors here.

The "generic" (requires 2 passes) collectors can group on anything at
search time; the "doc block" collector requires that you indexed all
docs in each group as a block.

Join is both about restricting matches and also presentation of hits,
because your query needs to match fields from different [logical]
tables (so, the module has a Query and a Collector).  When you get the
results back, you may or may not be interested in retaining the table
structure in your result set (ie, you may not have selected fields
from the child table).

Similarly, "generic" joining (in Solr/ElasticSearch today but I'd like
to factor into the join module) can do any join at search time, while
the "doc block" collector requires that you did the necessary join(s)
during indexing.

> * difference between the above joinish capabilities and solr's join
> impl... other than the single-pass/index-time limitation (which is
> really an implementation detail), I'm talking about use cases.

Solr's/ElasticSearch's join is more general because you can join
anything at search time (even, across 2 different indexes), vs doc
block join where you must pick which joins you will ever want to use
and then build the index accordingly.

You can also mix the two.  Maybe you do certain joins while indexing,
but then at search time you do other joins "generically".  That's
fine.  (Same is true for grouping).

> I think its especially interesting since the join module depends on
> the grouping module.

The join module does currently depend on the grouping module, but for
a silly reason: just for the TopGroups, to represent the returned
hits.  We could move TopGroups/GroupDocs into common (thus justifying
its generic name!)?  Then both join and grouping modules depend on
common.

Really TopGroups is just a TopDocs that allows some recursion (ie,
each hit may in turn be another TopDocs).  But TopGroups is limited
now to only depth 2 recursion... we need to fix this for nested
grouping.  Really we just need a recursive TopDocs here....

> So I am curious if we should:
> * add docs (maybe with simple examples) in the package.html or
> otherwise that differentiate what these guys are, or at least agree on
> some consistent terminology and define it somewhere? I feel like
> people have explained to me the differences in all these things
> before, but then its easy to forget.

Well, each module's package.html has a start here, but I agree we
should do more.

I think what would be best is a smallish but feature complete demo, ie
pull together some easy-to-understand sample content and the build a
small demo app around it.  We could then show how to use grouping for
field collapsing (and for other use cases), joining for nested docs
(and for other use cases), etc.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: revisit naming for grouping/join?

Reply via email to