[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734997#comment-14734997 ]
Tim Allison edited comment on LUCENE-5205 at 9/8/15 3:34 PM: ------------------------------------------------------------- This looks like a genuine issue in the Highlighter. I was hoping that it was LUCENE-5503 so that would get some attention, but I don't think it is. This is the minimal code to show the problem: {code} @Test public void testEmbeddedSpanNearHighlighterIssue() throws Exception { String field = "f"; Analyzer analyzer = new StandardAnalyzer(); String text = "b c d"; // SpanQueryParser p = new SpanQueryParser(field, analyzer); // Query q = p.parse("\"(b [c z]) d\"~2"); SpanQuery cz = new SpanNearQuery( new SpanQuery[]{ new SpanTermQuery(new Term(field, "c")), new SpanTermQuery(new Term(field, "z")) }, 0, true ); SpanQuery bcz = new SpanOrQuery( new SpanTermQuery(new Term(field, "b")), cz); SpanQuery q = new SpanNearQuery( new SpanQuery[]{ bcz, new SpanTermQuery(new Term(field, "d")) }, 2, false ); QueryScorer scorer = new QueryScorer(q, field); scorer.setExpandMultiTermQuery(true); Fragmenter fragmenter = new SimpleFragmenter(1000); Highlighter highlighter = new Highlighter( new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); highlighter.setTextFragmenter(fragmenter); String[] snippets = highlighter.getBestFragments(analyzer, field, text, 3); assertEquals(1, snippets.length); assertFalse(snippets[0].contains("<B>c</B>")); } {code} This problem does not happen if "c" comes before "a" or after "d" in the text: "c b d" or "b d c". was (Author: talli...@mitre.org): This looks like a genuine issue in the Highlighter. I was hoping that it was LUCENE-5503 so that would get some attention, but I don't think it is. This is the minimal code to show the problem: {code} @Test public void testEmbeddedSpanNearHighlighterIssue() throws Exception { String field = "f"; Analyzer analyzer = new StandardAnalyzer(); String text = "b c d"; // SpanQueryParser p = new SpanQueryParser("f", analyzer); // Query q = p.parse("\"(b [c z]) d\"~2"); SpanQuery cz = new SpanNearQuery( new SpanQuery[]{ new SpanTermQuery(new Term(field, "c")), new SpanTermQuery(new Term(field, "z")) }, 0, true ); SpanQuery bcz = new SpanOrQuery( new SpanTermQuery(new Term(field, "b")), cz); SpanQuery q = new SpanNearQuery( new SpanQuery[]{ bcz, new SpanTermQuery(new Term(field, "d")) }, 2, false ); QueryScorer scorer = new QueryScorer(q, "f"); scorer.setExpandMultiTermQuery(true); Fragmenter fragmenter = new SimpleFragmenter(1000); Highlighter highlighter = new Highlighter( new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); highlighter.setTextFragmenter(fragmenter); String[] snippets = highlighter.getBestFragments(analyzer, "f", text, 3); assertEquals(1, snippets.length); assertFalse(snippets[0].contains("<B>c</B>")); } {code} This problem does not happen if "c" comes before "a" or after "d" in the text: "c b d" or "b d c". > SpanQueryParser with recursion, analysis and syntax very similar to classic > QueryParser > --------------------------------------------------------------------------------------- > > Key: LUCENE-5205 > URL: https://issues.apache.org/jira/browse/LUCENE-5205 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Tim Allison > Labels: patch > Attachments: LUCENE-5205-cleanup-tests.patch, > LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, > LUCENE-5205_dateTestReInitPkgPrvt.patch, > LUCENE-5205_improve_stop_word_handling.patch, > LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, > SpanQueryParser_v1.patch.gz, patch.txt > > > This parser extends QueryParserBase and includes functionality from: > * Classic QueryParser: most of its syntax > * SurroundQueryParser: recursive parsing for "near" and "not" clauses. > * ComplexPhraseQueryParser: can handle "near" queries that include multiterms > (wildcard, fuzzy, regex, prefix), > * AnalyzingQueryParser: has an option to analyze multiterms. > At a high level, there's a first pass BooleanQuery/field parser and then a > span query parser handles all terminal nodes and phrases. > Same as classic syntax: > * term: test > * fuzzy: roam~0.8, roam~2 > * wildcard: te?t, test*, t*st > * regex: /\[mb\]oat/ > * phrase: "jakarta apache" > * phrase with slop: "jakarta apache"~3 > * default "or" clause: jakarta apache > * grouping "or" clause: (jakarta apache) > * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta > * multiple fields: title:lucene author:hatcher > > Main additions in SpanQueryParser syntax vs. classic syntax: > * Can require "in order" for phrases with slop with the \~> operator: > "jakarta apache"\~>3 > * Can specify "not near": "fever bieber"!\~3,10 :: > find "fever" but not if "bieber" appears within 3 words before or 10 > words after it. > * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta > apache\]~3 lucene\]\~>4 :: > find "jakarta" within 3 words of "apache", and that hit has to be within > four words before "lucene" > * Can also use \[\] for single level phrasal queries instead of " as in: > \[jakarta apache\] > * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 > :: find "apache" and then either "lucene" or "solr" within three words. > * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2 > * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ > /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two > words of "ap*che" and that hit has to be within ten words of something like > "solr" or that "lucene" regex. > * Can require at least x number of hits at boolean level: "apache AND (lucene > solr tika)~2 > * Can use negative only query: -jakarta :: Find all docs that don't contain > "jakarta" > * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of > potential performance issues!). > Trivial additions: > * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, > prefix =2) > * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance > <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein) > This parser can be very useful for concordance tasks (see also LUCENE-5317 > and LUCENE-5318) and for analytical search. > Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. > Most of the documentation is in the javadoc for SpanQueryParser. > Any and all feedback is welcome. Thank you. > Until this is added to the Lucene project, I've added a standalone > lucene-addons repo (with jars compiled for the latest stable build of Lucene) > on [github|https://github.com/tballison/lucene-addons]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org