Re: Question on highlighting of nested SpanQuery instances

Mark Miller Fri, 26 Feb 2010 07:42:28 -0800

Yeah, by all means open a JIRA issue. If you can get the old tests topass as well as your new test, that would be fantastic.


On 02/26/2010 10:32 AM, Goddard, Michael J. wrote:

Mark,

After making some changes to a few classes,

M src/java/org/apache/lucene/search/spans/TermSpans.java
Mcontrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.javaMcontrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.javaMcontrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTerm.java
the JUnit test below passes. I'm seeing some issues with other testswhich I'll have to take care of, and I'm not yet sure how I'll dealwith Spans instances (as opposed to NearSpansOrdered,NearSpansUnordered, and TermSpans), since it's an abstract class and Ican't call getSubSpans() on that. I was thinking I ought to open aJira issue for this, attach the current patch, and just keep working.Does this sound like something other users might find useful?
 Mike


-----Original Message-----
From:java-dev-return-46947-michael.j.goddard=saic....@lucene.apache.org onbehalf of Mark Miller
Sent: Mon 2/22/2010 3:41 PM
To: [email protected]
Subject: Re: Question on highlighting of nested SpanQuery instances
I played with it sometime back, but I don't have any code left fromthat exercise.
Its fairly tricky.

Take your example:

> SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
> new SpanTermQuery(new Term(fieldName, "lucene")),
> new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
>
> Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
> new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);

First you see the top level SpanNearQuery -
you want to recurse in and just work with the lucene within 5 of dog,ordered, part. But you can't actually work with that alone. That wholespan also has to be within 4 of hadoop ordered ... so how do youconstrain the sub highlighting? Lets say you do it somehow.
Now you recurse in an want to highlight hadoop - but again, not everyhadoop - only the haoops that are within 4, ordered, of the first Span.
So that's really the issue - you want to break up the Span andhighlight recursively - but you can't really break them up andmaintain all of the positional restrictions required.
So another possible option that gets a little messier might be:
when extracting the allowable positions for a term (which it does bychecking the start and end of span), you might also run each innerspan that contains that term, and then intersect the positions youfind that way with the positions found with the overall span and usethat list as the allowable positions. That could get kind ofcomplicated though, especially taking into account the logic of the orand spannot spanqueries.
- Mark

On 02/22/2010 03:15 PM, Goddard, Michael J. wrote:

 Mark,
Thanks a lot for the insight. I'm working with this todayand, diving into the WeightedSpanTermExtractor class and fiddling withit. If you ever did have any code which attempted to recurse intothese structures, I'd be happy to get my hands on it.
 Thanks again.

 Mike



 -----Original Message-----
 From: Mark Miller [mailto:[email protected]]
 Sent: Mon 2/22/2010 9:15 AM
 To: [email protected]
 Cc: Goddard, Michael J.
Subject: Re: Question on highlighting of nested SpanQueryinstances
 Hey Michael - this is currently just a limitation of the Span
 highlighter. It does a bit of fudging when determining what a good
position is - if a term from the text is found within the spanof aspanquery it is in (no matter how deeply nested), thehighlighter makesa guess that the term should be highlighted - this is becausewe don'thave the actual positions of each term - just the positions ofthe startand end of the span. In almost all cases this works as youwould expect- but when nesting spans like this, you can get spuriousresults within
 the overall span.
So your idea that we should recurse into the Span is on theright track
 - but it just gets fairly complicated quick. Consider
SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - ifwe recurse
 in an grab the first SpanNear (mark, miller, 3), we can correctly
 highlight that - but then we will handle lucene by itself - so all
lucene terms will be hit rather than the one within 4 of thefirst span.So you have to deal with SpanOr, SpanNear, SpanNotrecursively, but then
 also handle when they are linked, either with each other or with a
SpanTerm - and uh - its gets hard real fast. Hence thefuzziness that
 goes on now.
There may be something we can do to improve things in thefuture, butits kind of an accepted limitation at the moment - probsomething we
 should add some doc about.

 - Mark

 Goddard, Michael J. wrote:
>
> Hello,
>
> I initially posted a version of this question to java-user, but think
> it's more of a java-dev question. I haven't yet been able to resolve
> why I'm seeing spurious highlighting in nested SpanQuery instances.
> To illustrate this, I added the code below to the HighlighterTest
> class in lucene_2_9_1:
>
> /*
> * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
> */
> public void testHighlightingNestedSpans2() throws Exception {
>
> String theText = "The Lucene was made by Doug Cutting and Lucene
> great Hadoop was"; // Problem
> //String theText = "The Lucene was made by Doug Cutting and the
> great Hadoop was"; // Works okay
>
> String fieldName = "SOME_FIELD_NAME";
>
> SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
> new SpanTermQuery(new Term(fieldName, "lucene")),
> new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
>
> Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
> new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
>
> String expected = "The Lucene was made by Doug Cutting
> and Lucene great Hadoop was";
> //String expected = "The Lucene was made by Doug
> Cutting and the great Hadoop was";
>
> String observed = highlightField(query, fieldName, theText);
> System.out.println("Expected: \"" + expected + "\n" + "Observed: \""
> + observed);
>
> assertEquals("Why is that second instance of the term \"Lucene\"
> highlighted?", expected, observed);
> }
>
> Is this an issue that's arisen before? I've been reading through the
> source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor,
> Spans, and NearSpansOrdered, but haven't found the solution yet.
> Initially, I thought that the extractWeightedSpanTerms method in
> WeightedSpanTermExtractor should be called on each clause of a
> SpanNearQuery or SpanOrQuery, but that didn't get me too far.
>
> Any suggestions are welcome.
>
> Thanks.
>
> Mike
>


 --
 - Mark

http://www.lucidimagination.com









--
- Mark

http://www.lucidimagination.com



--
- Mark

http://www.lucidimagination.com

Re: Question on highlighting of nested SpanQuery instances

Reply via email to