[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029677#comment-14029677 ]
Tim Allison commented on LUCENE-5205: ------------------------------------- Sounds good. I agree that a separate issue woud probably be best, and I think extending SpanQueryParser or one of its base classes might be the best route to go. We can make whatever changes are necessary in the base classes to make extension easier/actually work. Things will get a bit more complicated if we have to change the lexer, which I think we will, but we can burn that bridge when we get to it. I looked over the javadocs last night, and wow, that is a great capability. For new syntax, you're right / / are taken for regexes. How about pipes to represent the hierarchical components: |a/b/c| I probably won't be on git until early next week...not as much extracurricular time available as I'd like, but I look forward to getting set up and collaborating on this. I'd want to get a +1 from [~rcmuir] on LUCENE-5470 and LUCENE-5504 before making those changes to trunk/4.x, but I think that they would be quite useful. On a related issue, I'm wondering if I should add FieldMaskingSpanQueries to the SpanOnlyParser. The original reason that I required a single field was because I was using it for concordancing of a single field (LUCENE-5317)...re-analyzing to get character offsets for span hits. I don't think that adding the FieldMaskingSpanQueries will wreck that. Will need to take a look though. > [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to > classic QueryParser > ----------------------------------------------------------------------------------------------- > > Key: LUCENE-5205 > URL: https://issues.apache.org/jira/browse/LUCENE-5205 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Tim Allison > Labels: patch > Fix For: 4.9 > > Attachments: LUCENE-5205-cleanup-tests.patch, > LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, > LUCENE-5205_dateTestReInitPkgPrvt.patch, > LUCENE-5205_improve_stop_word_handling.patch, > LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, > SpanQueryParser_v1.patch.gz, patch.txt > > > This parser extends QueryParserBase and includes functionality from: > * Classic QueryParser: most of its syntax > * SurroundQueryParser: recursive parsing for "near" and "not" clauses. > * ComplexPhraseQueryParser: can handle "near" queries that include multiterms > (wildcard, fuzzy, regex, prefix), > * AnalyzingQueryParser: has an option to analyze multiterms. > At a high level, there's a first pass BooleanQuery/field parser and then a > span query parser handles all terminal nodes and phrases. > Same as classic syntax: > * term: test > * fuzzy: roam~0.8, roam~2 > * wildcard: te?t, test*, t*st > * regex: /\[mb\]oat/ > * phrase: "jakarta apache" > * phrase with slop: "jakarta apache"~3 > * default "or" clause: jakarta apache > * grouping "or" clause: (jakarta apache) > * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta > * multiple fields: title:lucene author:hatcher > > Main additions in SpanQueryParser syntax vs. classic syntax: > * Can require "in order" for phrases with slop with the \~> operator: > "jakarta apache"\~>3 > * Can specify "not near": "fever bieber"!\~3,10 :: > find "fever" but not if "bieber" appears within 3 words before or 10 > words after it. > * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta > apache\]~3 lucene\]\~>4 :: > find "jakarta" within 3 words of "apache", and that hit has to be within > four words before "lucene" > * Can also use \[\] for single level phrasal queries instead of " as in: > \[jakarta apache\] > * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 > :: find "apache" and then either "lucene" or "solr" within three words. > * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2 > * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ > /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two > words of "ap*che" and that hit has to be within ten words of something like > "solr" or that "lucene" regex. > * Can require at least x number of hits at boolean level: "apache AND (lucene > solr tika)~2 > * Can use negative only query: -jakarta :: Find all docs that don't contain > "jakarta" > * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of > potential performance issues!). > Trivial additions: > * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, > prefix =2) > * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance > <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein) > This parser can be very useful for concordance tasks (see also LUCENE-5317 > and LUCENE-5318) and for analytical search. > Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. > Most of the documentation is in the javadoc for SpanQueryParser. > Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org