>> I think what would be best is a smallish but feature complete demo,
For the nested stuff I had a reasonable demo on LUCENE-2454 that was based around resumes - that use case has the one-to-many characteristics that lends itself to nested e.g. a person has many different qualifications and records of employment. This scenario was illustrated here: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene I also had the "book search" type scenario where a book has many sections and for the purposes of efficient highlighting/summarisation these sections were treated as child docs which could be read quickly (rather than highlighting a whole book) I'm not sure what the "parent" was in your doctor and cities example, Mike. If a doctor is in only one city then there is no point making city a child doc as the one city info can happily be combined with the doctor info into a single document with no conflict (doctors have different properties to cities). If the city is the parent with many child doctor docs that makes more sense but feels like a less likely use case e.g. "find me a city with doctor x and a different doctor y" Searching for a person with excellent java and prefrerably good lucene skills feels like a more real-world example. It feels like documenting some of the trade-offs behind index design choices is useful too e.g. nesting is not too great for very volatile content with constantly changing children while search-time join is more costly in RAM and 2-pass processing Cheers Mark ----- Original Message ---- From: Michael McCandless <luc...@mikemccandless.com> To: dev@lucene.apache.org Sent: Fri, 1 July, 2011 13:51:04 Subject: Re: revisit naming for grouping/join? I think joining and grouping are two different functions, and we should keep different modules for them... On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir <rcm...@gmail.com> wrote: > Hi, > > when looking at just a very quick glance at some of the newer > grouping/join features, I found myself a little confused about what is > exactly what, and I think users might too. They are confusing! > I discussed some of this with hossman, and it only seemed to make me > even more totally confused about: > * difference between field collapsing and grouping I like the name grouping better here: I think field collapsing undersells (it's only one specific way to use grouping). EG, grouping w/o collapsing is useful (eg, Best Buy grouping hits by product category and showing the top 5 in each). > * difference between nested documents and the index-time join Similarly I think nested docs undersells index-time join: you can join (either during indexing or during searching) in many different ways, and nested docs is just one use case. EG, maybe your docs are doctors but during indexing you join to a city table with facts about that city (each doctor's office is in a specific city) and then you want to run queries like "city's avg annual temp > 60 and doctor has good bedside manner" or something. > * difference between index-time-join/nested documents and single-pass > index-time grouping. Is the former only a more general case of the > latter? Grouping is purely a presentation concern -- you are not altering which docs hit; you are simply changing how you pick which hits to display ("top N by group"). So we only have collectors here. The "generic" (requires 2 passes) collectors can group on anything at search time; the "doc block" collector requires that you indexed all docs in each group as a block. Join is both about restricting matches and also presentation of hits, because your query needs to match fields from different [logical] tables (so, the module has a Query and a Collector). When you get the results back, you may or may not be interested in retaining the table structure in your result set (ie, you may not have selected fields from the child table). Similarly, "generic" joining (in Solr/ElasticSearch today but I'd like to factor into the join module) can do any join at search time, while the "doc block" collector requires that you did the necessary join(s) during indexing. > * difference between the above joinish capabilities and solr's join > impl... other than the single-pass/index-time limitation (which is > really an implementation detail), I'm talking about use cases. Solr's/ElasticSearch's join is more general because you can join anything at search time (even, across 2 different indexes), vs doc block join where you must pick which joins you will ever want to use and then build the index accordingly. You can also mix the two. Maybe you do certain joins while indexing, but then at search time you do other joins "generically". That's fine. (Same is true for grouping). > I think its especially interesting since the join module depends on > the grouping module. The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could move TopGroups/GroupDocs into common (thus justifying its generic name!)? Then both join and grouping modules depend on common. Really TopGroups is just a TopDocs that allows some recursion (ie, each hit may in turn be another TopDocs). But TopGroups is limited now to only depth 2 recursion... we need to fix this for nested grouping. Really we just need a recursive TopDocs here.... > So I am curious if we should: > * add docs (maybe with simple examples) in the package.html or > otherwise that differentiate what these guys are, or at least agree on > some consistent terminology and define it somewhere? I feel like > people have explained to me the differences in all these things > before, but then its easy to forget. Well, each module's package.html has a start here, but I agree we should do more. I think what would be best is a smallish but feature complete demo, ie pull together some easy-to-understand sample content and the build a small demo app around it. We could then show how to use grouping for field collapsing (and for other use cases), joining for nested docs (and for other use cases), etc. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org