[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

Tim Allison (JIRA) Thu, 20 Feb 2014 10:53:01 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907304#comment-13907304
 ]


Tim Allison commented on LUCENE-5205:
-------------------------------------

Code duplication.  The biggest offenders are in test (I think, let me know if 
you disagree):
     1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend 
QPTestBase, but that will require some reworking of QPTestBase as, and I didn't 
want to touch that (hence the duplication).  It would also help to add a 
getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't 
want to touch anything outside of the parser at the cost of duplication. 
     2) TestMultiAnalyzer...not sure how not to duplicate.  This relies on 
testing equality of string representations of queries.
     3) TestComplexPhraseQuery.  Should be straightforward to extend the 
original, but will need to make checkMatches public so that I can override it.  
I'll also have to move the tests with slightly different syntax into a 
different test, but that's easy and would help declutter.

There's other code duplication with AnalyzingQueryParser...should we break that 
functionality out into a helper class?

Any other major duplication areas?

Y, I don't like the reinit at all.  The reason that's there was so that I could 
extend QueryParserBase, but I'm not sure that that decision buys much anymore.  
As I remember, it buys date parsing in range queries (which I'm now not sure I 
actually want) and addBoolean; there may be more, but I'm not sure there is.

It would clean up a fair bit of code if I implement 
CommonQueryParserConfiguration instead of extending QueryParserBase.  I'd still 
have to leave in some things that don't make sense for the SpanQueryParser, 
though: lowerCaseExpandedTerms, enablePositionIncrements.  Another option would 
be to abandon CQPC, but I wanted this parser to at least implement that 
interface.  Let me know what makes sense.  

As for the public base classes, y, those can go private for now.  I made them 
public in case anyone wanted to extend them, but, as you point out, then I 
really ought to add javadocs and treat them as if they were public (which they 
are!).

As for date/locale issues, I'll take a look.    

> [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
> classic QueryParser
> -----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5205
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/queryparser
>            Reporter: Tim Allison
>              Labels: patch
>             Fix For: 4.7
>
>         Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
> LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
> SpanQueryParser_v1.patch.gz, patch.txt
>
>
> This parser extends QueryParserBase and includes functionality from:
> * Classic QueryParser: most of its syntax
> * SurroundQueryParser: recursive parsing for "near" and "not" clauses.
> * ComplexPhraseQueryParser: can handle "near" queries that include multiterms 
> (wildcard, fuzzy, regex, prefix),
> * AnalyzingQueryParser: has an option to analyze multiterms.
> At a high level, there's a first pass BooleanQuery/field parser and then a 
> span query parser handles all terminal nodes and phrases.
> Same as classic syntax:
> * term: test 
> * fuzzy: roam~0.8, roam~2
> * wildcard: te?t, test*, t*st
> * regex: /\[mb\]oat/
> * phrase: "jakarta apache"
> * phrase with slop: "jakarta apache"~3
> * default "or" clause: jakarta apache
> * grouping "or" clause: (jakarta apache)
> * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
> * multiple fields: title:lucene author:hatcher
>  
> Main additions in SpanQueryParser syntax vs. classic syntax:
> * Can require "in order" for phrases with slop with the \~> operator: 
> "jakarta apache"\~>3
> * Can specify "not near": "fever bieber"!\~3,10 ::
>     find "fever" but not if "bieber" appears within 3 words before or 10 
> words after it.
> * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
> apache\]~3 lucene\]\~>4 :: 
>     find "jakarta" within 3 words of "apache", and that hit has to be within 
> four words before "lucene"
> * Can also use \[\] for single level phrasal queries instead of " as in: 
> \[jakarta apache\]
> * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 
> :: find "apache" and then either "lucene" or "solr" within three words.
> * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2
> * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
> /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two 
> words of "ap*che" and that hit has to be within ten words of something like 
> "solr" or that "lucene" regex.
> * Can require at least x number of hits at boolean level: "apache AND (lucene 
> solr tika)~2
> * Can use negative only query: -jakarta :: Find all docs that don't contain 
> "jakarta"
> * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of 
> potential performance issues!).
> Trivial additions:
> * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
> prefix =2)
> * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
> <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein)
> This parser can be very useful for concordance tasks (see also LUCENE-5317 
> and LUCENE-5318) and for analytical search.  
> Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
> Most of the documentation is in the javadoc for SpanQueryParser.
> Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

Reply via email to