Thanks Gregor, I'll give it a try...

Iain

*****************************************
*  Micro Focus Developer Forum 2004     *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum          *
*****************************************

-----Original Message-----
From: Gregor Heinrich [mailto:[EMAIL PROTECTED]
Sent: 15 December 2003 18:32
To: 'Lucene Users List'
Subject: RE: Disabling modifiers?


If you don't want to fiddle with the JavaCC source of QueryParser.jj, you
could work with a regular expression that works in front of the actual query
parser. I just did something similar because I input Lucene's query strings
into a latent semantic analysis algorithm and remove words with + and ?
wildcards, boosting modifiers as well as NOT and - clauses and groupings.
Such as:

    /**
     *  exclude words that have these modifiers
     */
    public final String excludeWildcards = "\\w+\\+|\\w+\\?";
    /**
     *  remove these operators
     */
    public final String removeOperators = "AND|OR|UND|ODER|&&|\\|\\|";
    /**
     *  remove these modifiers
     */
    public final String removeModifiers = "~[0-9\\.]*|~|\\^[0-9\\.]*|\\*";
    /**
     *  exclude phrases that have these modifiers
     */
    public final String excludeNot = "(NOT |\\-) *\\w+|(NOT|\\-)
*\\([^\\)]+\\)|(NOT |\\-) *\\\"[^\\\"]+\\\"";

    /**
     * remove any groupings
     */
    public final String removeGrouping = "[\\\"\\(\\)]";

You then create Pattern objects from the strings using Pattern.compile() and
can use and re-use the compiled patterns.

excludeWildcardsPattern = Pattern.compile(excludeWildcards);

lsaQ = excludeWildcardsPattern.matcher(q).replaceAll("");

This works fine for me. However, this 20 minutes approach does not recognise
nested parentheses with NOT or -, i.e.,
the term <tt>"NOT ((a OR b) AND (c OR d))"</tt> will result in the removal
of <tt>"NOT ((a OR b"</tt> and <tt>"c d"</tt> will still be in the output
query.

Best regards,

Gregor

-----Original Message-----
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: Monday, December 15, 2003 6:13 PM
To: Lucene mailing list (E-mail)
Subject: Disabling modifiers?


A quick question. Is there any way to disable the - and + modifiers in the
QueryParser? I'm trying to use Lucene to provide indexing of COBOL source
code, and allow me to highlight matches when the code is displayed. In COBOL
you can have variable names such as DISP-NAME and WS-DATE-1 for example.
Unfortunately the query parser interprets the - signs as modifiers and so
the query does not do what is required.

I've had a bit of success by putting quotes around the offending names, (as
suggested on this list), but the results are still less than satisfactory,
(it removes the "NOT" from the query, but still treats DISP and NAME as two
separate words rather than one word and so the results are not quite
correct).

Any ideas, or am I going to have to try and write my own query parser?

Thanks,
Iain


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


________________________________________________________________________
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com
________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to