[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Simon Willnauer (JIRA) Fri, 13 Nov 2009 03:37:18 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777465#action_12777465
 ]


Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

Luis,
{quote}
    syntax:
    extension:fieldname:"syntax"

    examples:
    regexp:title:"/blah[a-z]+[0-9]+/" <- regexp extension, title index field
    complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, 
title index field

    regexp_phrase::"/blah[a-z]+[0-9]+/" <- regexp extension, default field
    complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default 
field

    title:"blah" <- regular field query
{quote}

This is pretty much what I suggested above. We can extend the queryparser 
without breaking the backwards compatibility just by adding some code which is 
aware of the fieldname scheme. Even this could be extendable. FieldNames are 
terms and therefore they can not contain unescaped special chars like : { ] ... 
I would not even hard code the separator into the query parser but have the 
field name processed by something pluggable. So If somebody wants to have a 
regex extension they could use re\:field: or re\:: or re_field:.... 
Escaping a field is easy, just like you would do it with a term. 
More interesting is that we do not change any syntax, no special character but 
we can add a default implementation with a default implementation for 
extensions. This could be a whole API which takes are of creating and escaping 
the field name, building the query once it is passed to the extension etc. 
In a first step we can resolve the extension the second step calls the 
extension and build the query. If no extension is registered the query parser 
works like in previous versions so it is all up to the user.

@Adriano:
{quote}
The only part I disagree is when you pass the fieldname to the extension 
parser, I wouldn't implement that on the contrib parser, because it assumes the 
syntax always has field names. Anyway, for the core QP, I see the reason why 
you pass the fieldname
{quote}

You need the field to create you query in the extension, the field will always 
be set to either the default field or the explicitly defined field in the 
query. No reason why we should not pass it.
I agree with you that we should wrap the information in a class so that we do 
not need to change the method signature if something has to be changed in the 
future. Instead we just add a new member to the wrapper though.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Reply via email to