Nice example, Greg; yes I can see how this can all become a little bewildering if you mix documents of different types having the same field. Keeping in mind that types are not really represented in the index, but are really just mental constructs we impose, it's not all that surprising they don't always respect the boundaries we have in mind.

-Mike

On 12/17/2014 8:24 PM, Gregory Dearing wrote:
Michael,

I think I understand your point.  I should mention that I'm just a user of
BlockJoin... a year ago, I was doing the same tests you are now and was
just trying to share my observations. :)

I think either rule would be reasonable, but probably prefer that it throws
exception... this helps provide some structure in a mechanic that already
gets a little crazy.

Also... your approach sounds fine, but I'd still like to suggest that best
practice is to ensure subqueries can only match one 'type' of document.
This becomes important if you have anything more complex than a flat
hierarchy.

As an example, suppose you had the following relationships...

1.) A 'Book' has 'Chapters', which have 'Paragraphs', which are searched
via the 'text' field.
2.) A 'Book' may have an 'Appendix'' which is searched via the 'text' field.

A query like ((text:apple JoinTo type:chapter) AND (text:tree JoinTo
type:chapter)) JoinTo type:book" would seem like a reasonable way to find
books where both terms occurred in the same chapter.

But what happens is that the term queries will hit Appendices, each of
which will be joined to the first Chapter in the next Book.  The search
might return Books whose first chapter has the word 'tree', because the
previous book's Appendix had the word 'apple'. :)

It's equally possible to accidentally create a 'ToUncleJoin' or
'ToCousinJoin'.

Just my two cents,
Greg


On Tue, Dec 16, 2014 at 8:42 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:
Looking at the code, there are explicit checks for if (childId ==
parentId) throw an exception ...
It seems to me that instead, the logic *could* be if (childId ==
parentId) then --- accumulate the parentId as if it were a child *and*
terminate the block.
In your phraseology, we could change "A child's parent is the closest
parent docId that is larger than the child's docId." to "A child's parent
is the closest parent docId that is greater than or equal to the child's
docId."
I agree that the solution to my problem (without changing Lucene) is to
index the parent doc fields in a new child doc (we use a docType field to
distinguish -- changing the names of all the fields at this point would be
kind of painful).  But I was just curious whether there was any reason in
principle that a doc could not be its own parent.
-Mike



On 12/16/2014 8:20 PM, Gregory Dearing wrote:
Michael,

Note that the index doesn't contain any special information about
block-join relationships... it uses a convention that child docs are
indexed before parent docs (ie. the root doc in each hierarchy has the
largest docId in its block).

This means that it can 'join' to parents just by comparing child
docIds (from the subquery set) to the set of parent docIds.  A child's
parent is the closest parent docId that is larger than the child's
docId.

That explanation is all just to say... if your subquery matched a
parent, then joined to a parent set, and no exception was thrown, the
resulting answer will be in the NEXT BOOK.  (The closest docId that is
larger than a parent's docId in the parent set, will be from another
document block)

I would suggest using different field names for each level of a block
hierarchy, just so you can be sure what level your original query
actually hits.  You could accomplish the same by adding a 'docType'
field.

In your case, you might consider pushing your 'Book' level fields into
a special child doc.  For example, your Book document could have no
searchable fields; its children could include both 'Chapter' child
docs and also a 'BookMetadata' child doc.

-Greg




On Tue, Dec 16, 2014 at 10:42 AM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:
OK - I see looking at the code that an exception is thrown if a parent
doc
matches the subquery -- so that explains what will happen, but I guess
my
further question is -- is that necessary? Could we just not throw an
exception there?

-Mike


On 12/16/2014 10:38 AM, Michael Sokolov wrote:
I see in the docs of ToParentBlockJoinQuery that:

   * The child documents must be orthogonal to the parent
   * documents: the wrapped child query must never
   * return a parent document.

First, it would be helpful if the docs explained what would happen if
that
assumption were violated.

Second, I want to do that!

My parent documents have the same fields as their child documents
(title,
text, etc): in some cases the best match for a query is the entire
book, (ie
a query for "Java Programming"), in other cases it is a specific
chapter (a
query for "Java regular expressions").

Currently I am using Solr grouping queries to roll up parent and child,
but I am hoping to get a performance boost by using the parent/child
indexing which is a natural for us since we always index a book at a
time.
If need be, I will simply index a child document that represents the
parent (ie duplicate the parent document but with a different type so
as to
exclude it from the join subquery), but is this really necessary? If
so, can
you explain why?


Thanks

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to