observations welcome on how to approach this

Jonathan Rochkind Tue, 23 Nov 2010 19:05:02 -0800

I gather that your solr documents are the "Title Information" units. Have you 
considered making your Solr document collection be the "book information" units 
instead?   Each "book information" document will have (yes, de-normalized) the 
same "title" information as all the other book documents belonging to the same 
'title information'.  You can even give each 'book information' document some 
kind of 'title information' id that can be used to fetch all 'book information' 
documents belonging to the same 'title information'.  (If we just call these 
'bibs' and 'holdings' this might be less confusing for us library people).


No doubt modelling things this way will bring it's own challenges, but it will 
solve the particular problems you mention, I believe. Solr is not an rdbms, and 
de-normalizing to the right level, so your solr documents represent the proper 
units of granularity for the kinds of queries you want to do, is usually the 
trick to getting solr to do what you want. 

The challenge, of course, is when the kinds of queries you want really require 
multiple different levels of granularity.  I haven't found any great general 
purpose solutions to this problem, it's sort of the gap between what solr is 
good at and what an rdbms is good at. 
________________________________________
From: Bob Sandiford [bob.sandif...@sirsidynix.com]
Sent: Tuesday, November 23, 2010 7:26 PM
To: solr-user@lucene.apache.org
Subject: Special Parent / Child relationship - advice / observations welcome on 
how to approach this

Hi,

Long post - sorry...

I have a relatively special case of a Parent / Child relationship that I'm 
trying to model.

I'm currently using Solr 1.4.1 and Lucene 2.9.3

For example, my Parent documents represent "Title Information" (e.g. 
bibliographic information), and each Parent document can contain 0 or more 
Children, each child representing a physical copy of the Book information (e.g. 
think of a library with multiple branches, the child documents each represent a 
books (or other format) available at a given branch).

What I *think* is special about this set up is that the only information I need 
Solr / Lucene to make use of is all Facet based, AND, I don't need facet counts 
for these child facets.

So, for example, "Harry Potter and the Last Crusade"  (J.K. Rowlings' upcoming 
block buster novel :)) has the following information.

Location               Format
  Main                      Book
  Main                      DVD
  Branch                  Book

etc etc.  This is a bit simplified - there are actually 5 fields involved for 
each child document, each of the five fields can be used individually (easy!) 
or in combination (much harder!) to refine a result set.

With the relatively straightforward approach of having Location and Format as 
ordinary everyday facet fields, I certainly get results (although I ignore the 
facet counts).  So, I search for the book, and, without any facets applied, get 
back facets like this:

Location
   Main
   Branch

Format
  Book
  DVD

And, I can narrow by those, and things work - though logically there's a hole 
that I'm trying to fill.

For example, suppose the user chooses to narrow by Location:Branch and 
Format:DVD.  I still get a hit back - but I don't want one, because there isn't 
a child record that has both of those values.  (The user is looking for DVD's 
at the Branch library, but the only DVD is at Main).


I'm completely controlling both the indexing and searching side code - i.e. I 
can formulate any type of document content I want to be indexed, and I can 
parse the results before presenting them to the end users.


One approach I've been thinking of is a brute force method of accomplishing 
this, using Facets and using the facet.prefix parameter in the query.

So - I could generate 'facets' like this:

Location_facet
   Main
  Branch

Format_facet
  Book
  DVD

Location-Format_facet
  Main-Book
  Main-DVD
  Branch-Book

Format-Location_facet
  Book-Main
  Book-Branch
  DVD-Main

When narrowing by a single facet (e.g. Location:Branch), it would be a usual 
facet search.  Something like:

   &fq=Location_facet:"Branch"

and I would request back facets like this

  
&facet=true&facet.mincount=1&facet.field=Location-Format_facet&facet.prefix="Branch-"

and then parse out the values returned in the Location-Format_facet to retrieve 
what follows the "Branch-" prefix, and those would be the facet values for the 
'Format' facet presented to the users (so only 'Book' remains as a value).

So - with only 2 fields, it's pretty straightforward.

(It could be somewhat simplified from the above down to two facet fields 
instead of four - just keeping the paired facets, and not using the singleton 
facets, retrieving just one of those paired fields when no limiting is taking 
place, and parsing out the pairs for the Location and Format facets, and then 
when limiting on one element use facet.prefix, when limiting on both, again 
just choose one of the facets and look for the concatenated value...)


However it gets more complex as I ramp up to 5 fields.  (generally it requires 
n! individual facet fields, where 'n' is the number of underlying fields.  i.e. 
with two fields, there are two facet type fields needed in the Solr/Lucene 
index to support this.  With three fields, I could do this with 6 facets 
required.  With 5 fields, there would be  5! = 120 required facets.  That's 
getting a bit much... :)   Hmmm...   A little scribbling (ok, a fair bit of 
scribbling), and I can actually reduce that to 12 facet fields to cover the 5 
fields.   So, maybe that's not all that bad...  Interesting...

(I haven't actually coded up anything yet - this is all a paper-napkin level 
exercise...)



The other thing I've done is perused various archived threads and some upcoming 
functionality regarding parent / child or hierarchical document strategies.  
But, I haven't found anything that would help me out much - at least not 
directly.

I saw the Jira  LUCENE-2454<https://issues.apache.org/jira/browse/LUCENE-2454>  
Nested Document Query Support. which looks from the slides overview to be 
structurally just what I would want - but indicates that there isn't the Query 
Parser support in place yet...  (I.E. how to do a Solr query being able to 
relate child level queries either within the base query, or in the fq clause...


So - my question (finally :)) is - does this logical problem seem resolvable, 
with an approach other than the brute force outlined above?  I'm willing to 
dive into the Solr / Lucene code if that's what it will take - I'd just like an 
indication of what people think would be a good / possible approach before I 
get into that level...  e.g. some way of providing to the Indexer a tuple of 
each found combination of the 5 values, and then doing something (what?) with 
searching for the facet queries

Thanks!


Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com<http://www.sirsidynix.com/>

RE: Special Parent / Child relationship - advice / observations welcome on how to approach this

Reply via email to