[jira] [Comment Edited] (LUCENE-5205) SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

Tim Allison (JIRA) Tue, 08 Sep 2015 08:36:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734997#comment-14734997
 ]


Tim Allison edited comment on LUCENE-5205 at 9/8/15 3:34 PM:
-------------------------------------------------------------

This looks like a genuine issue in the Highlighter.  I was hoping that it was 
LUCENE-5503 so that would get some attention, but I don't think it is.

This is the minimal code to show the problem:
{code}
  @Test
  public void testEmbeddedSpanNearHighlighterIssue() throws Exception {
    String field = "f";
    Analyzer analyzer = new StandardAnalyzer();
    String text = "b c d";

//    SpanQueryParser p = new SpanQueryParser(field, analyzer);
//    Query q = p.parse("\"(b [c z]) d\"~2");
    SpanQuery cz = new SpanNearQuery(
        new SpanQuery[]{
            new SpanTermQuery(new Term(field, "c")),
            new SpanTermQuery(new Term(field, "z"))
        }, 0, true
    );
    SpanQuery bcz = new SpanOrQuery(
        new SpanTermQuery(new Term(field, "b")),
            cz);
    SpanQuery q = new SpanNearQuery(
        new SpanQuery[]{
            bcz,
            new SpanTermQuery(new Term(field, "d"))
        }, 2, false
    );
    QueryScorer scorer = new QueryScorer(q, field);
    scorer.setExpandMultiTermQuery(true);


    Fragmenter fragmenter = new SimpleFragmenter(1000);

    Highlighter highlighter = new Highlighter(
        new SimpleHTMLFormatter(),
        new SimpleHTMLEncoder(),
        scorer);
    highlighter.setTextFragmenter(fragmenter);
    String[] snippets = highlighter.getBestFragments(analyzer,
        field, text,
        3);
    assertEquals(1, snippets.length);
    assertFalse(snippets[0].contains("<B>c</B>"));
  }
{code}

This problem does not happen if "c" comes before "a" or after "d" in the text: 
"c b d" or "b d c".


was (Author: talli...@mitre.org):
This looks like a genuine issue in the Highlighter.  I was hoping that it was 
LUCENE-5503 so that would get some attention, but I don't think it is.

This is the minimal code to show the problem:
{code}
  @Test
  public void testEmbeddedSpanNearHighlighterIssue() throws Exception {
    String field = "f";
    Analyzer analyzer = new StandardAnalyzer();
    String text = "b c d";

//    SpanQueryParser p = new SpanQueryParser("f", analyzer);
//    Query q = p.parse("\"(b [c z]) d\"~2");
    SpanQuery cz = new SpanNearQuery(
        new SpanQuery[]{
            new SpanTermQuery(new Term(field, "c")),
            new SpanTermQuery(new Term(field, "z"))
        }, 0, true
    );
    SpanQuery bcz = new SpanOrQuery(
        new SpanTermQuery(new Term(field, "b")),
            cz);
    SpanQuery q = new SpanNearQuery(
        new SpanQuery[]{
            bcz,
            new SpanTermQuery(new Term(field, "d"))
        }, 2, false
    );
    QueryScorer scorer = new QueryScorer(q, "f");
    scorer.setExpandMultiTermQuery(true);


    Fragmenter fragmenter = new SimpleFragmenter(1000);

    Highlighter highlighter = new Highlighter(
        new SimpleHTMLFormatter(),
        new SimpleHTMLEncoder(),
        scorer);
    highlighter.setTextFragmenter(fragmenter);
    String[] snippets = highlighter.getBestFragments(analyzer,
        "f", text,
        3);
    assertEquals(1, snippets.length);
    assertFalse(snippets[0].contains("<B>c</B>"));
  }
{code}

This problem does not happen if "c" comes before "a" or after "d" in the text: 
"c b d" or "b d c".

> SpanQueryParser with recursion, analysis and syntax very similar to classic 
> QueryParser
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5205
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/queryparser
>            Reporter: Tim Allison
>              Labels: patch
>         Attachments: LUCENE-5205-cleanup-tests.patch, 
> LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
> LUCENE-5205_dateTestReInitPkgPrvt.patch, 
> LUCENE-5205_improve_stop_word_handling.patch, 
> LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
> SpanQueryParser_v1.patch.gz, patch.txt
>
>
> This parser extends QueryParserBase and includes functionality from:
> * Classic QueryParser: most of its syntax
> * SurroundQueryParser: recursive parsing for "near" and "not" clauses.
> * ComplexPhraseQueryParser: can handle "near" queries that include multiterms 
> (wildcard, fuzzy, regex, prefix),
> * AnalyzingQueryParser: has an option to analyze multiterms.
> At a high level, there's a first pass BooleanQuery/field parser and then a 
> span query parser handles all terminal nodes and phrases.
> Same as classic syntax:
> * term: test 
> * fuzzy: roam~0.8, roam~2
> * wildcard: te?t, test*, t*st
> * regex: /\[mb\]oat/
> * phrase: "jakarta apache"
> * phrase with slop: "jakarta apache"~3
> * default "or" clause: jakarta apache
> * grouping "or" clause: (jakarta apache)
> * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
> * multiple fields: title:lucene author:hatcher
>  
> Main additions in SpanQueryParser syntax vs. classic syntax:
> * Can require "in order" for phrases with slop with the \~> operator: 
> "jakarta apache"\~>3
> * Can specify "not near": "fever bieber"!\~3,10 ::
>     find "fever" but not if "bieber" appears within 3 words before or 10 
> words after it.
> * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
> apache\]~3 lucene\]\~>4 :: 
>     find "jakarta" within 3 words of "apache", and that hit has to be within 
> four words before "lucene"
> * Can also use \[\] for single level phrasal queries instead of " as in: 
> \[jakarta apache\]
> * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 
> :: find "apache" and then either "lucene" or "solr" within three words.
> * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2
> * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
> /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two 
> words of "ap*che" and that hit has to be within ten words of something like 
> "solr" or that "lucene" regex.
> * Can require at least x number of hits at boolean level: "apache AND (lucene 
> solr tika)~2
> * Can use negative only query: -jakarta :: Find all docs that don't contain 
> "jakarta"
> * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of 
> potential performance issues!).
> Trivial additions:
> * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
> prefix =2)
> * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
> <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein)
> This parser can be very useful for concordance tasks (see also LUCENE-5317 
> and LUCENE-5318) and for analytical search.  
> Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
> Most of the documentation is in the javadoc for SpanQueryParser.
> Any and all feedback is welcome.  Thank you.
> Until this is added to the Lucene project, I've added a standalone 
> lucene-addons repo (with jars compiled for the latest stable build of Lucene) 
>  on [github|https://github.com/tballison/lucene-addons].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5205) SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

Reply via email to