Hi All,
I posted this again in the dev-list hoping to catch attentions from more
Lucene developers.
TIA,
victor
---------- Forwarded Message ----------
Subject: Re: '-' character not interpreted correctly in field names
Date: Thu, 10 Jul 2003 10:11:13 +1000
From: Victor Hadianto <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Yep tried that. Actually there is more to the creation of the field than just
in this line:
fieldToken=<TERM> <COLON> { field = fieldToken.image; }
Because I've created a <FIELDNAME> which is exactly the same with <TERM>
which
looks like this:
| <FIELDNAME: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
and change fieldToken to:
fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
And it doesn't work. Simple query such as to:tom* is parsed as blank query.
I will continue looking at this problem and will post my solution if I get
it, in the mean time I really do appreciate any help and suggestions.
cheers,
victor
On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> You left out the ~ character in your _FIELDNAME_START_CHAR production. That
> character tells the grammar that it should take all the characters except
> the ones you specified (the complement).
>
> Change:
> | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
>
> To:
> | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
>
> and it should probably work.
>
> Eric
>
> -----Original Message-----
> From: Victor Hadianto [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 09, 2003 4:53 AM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
>
>
> Hi Erik and others,
>
> I'm looking for a similar solution where I need QueryParser not to drop the
> "-" characters from the field name. Hower outside the field I do want the -
> sign interpreted as "not" modifier.
>
> I'm definitely not an expert in JavaCC and to be honest I only have a
> limited idea about Erik's suggestion work,
>
> Anyway I followed the suggestion and added the following:
> | <#_WHITESPACE: ( " " | "\t" ) >
> | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> | "^",
>
> "[", "]", "\"", "{", "}", "~", "*", "?" ]
>
> | <_ESCAPED_CHAR> ) >
> |
> | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> and again below I added:
> | <TERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
> | <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)* >
>
> And I changed:
>
> LOOKAHEAD(2)
> fieldToken=<TERM> <COLON> { field = fieldToken.image; }
>
> to: ...
>
> LOOKAHEAD(2)
> fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
>
>
> Well after doing all this mods all the query that involved field names
> cause problem, for example if I searched for
>
> fieldname:hello
>
> The query is blank (yes blank, nothing in it)
>
> and if the fieldname does contain a dash ("-") for example:
> field-name:hello
>
> They query is: +field -name
>
> hello is dropped.
>
>
> Does anyone has any idea? Help and suggestions will be much appreciated. I
> really need to get this dash working, changing the field name will be my
> last resort which I won't explore until I really have to.
>
>
> Thanks,
>
> Victor
>
> On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> > I think the query parser changes would not be too bad, I've outlined a
> > couple of relavant lines you should look at so you don't have to try
> > and comprehend the productions for the entire QueryParser. I do not
> > think I would like to have to maintain one of those myself though.
> > Your other unmentioned alternative is to choose field names that match
> > the <TERM> production of QueryParser.jj without escapes.
> >
> > QueryParser.jj line 557:
> > fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> >
> > and earlier...
> > <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
> > "[", "]", "\"", "{", "}", "~", "*", "?" ] >
> >
> > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> > | "^",
> >
> > "[", "]", "\"", "{", "}", "~", "*", "?" ]
> >
> > | <_ESCAPED_CHAR> ) >
> > |
> > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
> >
> > ...
> >
> > <TERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
> >
> > So the characters you need to avoid in your field names are the ones
> > from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[",
> > "]", "\"", "{", "}", "~", "*", "?" ]
> >
> > If you need to modify the parser, you will probably want to add a
> > FIELDNAME token and other supporting productions that look really
> > similar to these lines I've copied but modify the complement, ~[...],
> > at the beginning of _FIELDNAME_START_CHAR (you would add this
> > production) so it will match the "-" that you are using in your field
> > names (and fix it to match any other characters you want to use in
> > field names that it doesn't allow right now).
> >
> > Eric
> >
> > -----Original Message-----
> > From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, May 14, 2003 2:26 PM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> > Eric Isakson wrote:
> > > I just looked at the QueryParser.jj code, your field names
> > >
> > > never get processed by the analyzer. It does look like the > query
> >
> > parser will honor escapes though. I haven't tried > this, but try a
> > query like "foo\-bar:foo" and have
> >
> > > a look at the QueryParser.jj file for how it handles field
> > >
> > > names when parsing your query.
> >
> > Hrm.. that's what I had found too. So, you're saying that, other than
> > escaping dashes, I'd have to change QueryParser.. ?
> >
> > I'm not too familiar just yet with JavaCC syntax, so reading through
> > QueryParser is a little tough going. Thanks Eric,
> >
> > jp
> >
> > > -----Original Message-----
> > > From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> > > Sent: Monday, May 12, 2003 4:03 PM
> > > To: Lucene Users List
> > > Subject: Re: '-' character not interpreted correctly in field names
> > >
> > >
> > > Hi Otis, Terry,
> > >
> > > >>>You can write a custom Analyzer that does not remove dashes from
> > > >>>
> > > >>>tokens, and use it for both indexing and searching. >>> >>>This
> > >
> > > is a frequent question and answer on this list.
> > >
> > > Sorry for the noise, but I haven't been able to find a solution in
> > > the mailing list archives, or by writing my own analyzer:
> > >
> > > public class MyAnalyzer extends Analyzer {
> > > public TokenStream tokenStream(String fieldName, Reader reader)
> > > {
> > > return new CharTokenizer(reader) {
> > > protected boolean isTokenChar(char c) {
> > > return Character.isLetter(c) || c == '-';
> > > }
> > > };
> > > }
> > > }
> > >
> > > I parse a query like this:
> > >
> > > String queryString = "foo-bar:foo";
> > > String queryResult =
> > > QueryParser.parse(queryString, "body", new MyAnalyzer())
> > >
> > > With the output:
> > > body:foo -bar:foo
> > >
> > > But I would expect the output:
> > > foo-bar:foo
> > >
> > > If I print out the tokens that MyAnalyzer produces I do get
> > > "foo-bar" and then "foo".
> > >
> > > Any pointers on what I'm doing wrong?
> > >
> > > jp
> > >
> > >>>>--- Jon Pipitone <[EMAIL PROTECTED]> wrote:
> > >>>>>Hi all,
> > >>>>>
> > >>>>>>I believe that the tokenizer treats a dash as a token
> > >>>
> > >>>separator.
> > >>>
> > >>>>>>Hence, the only way, as I recall, to eliminate this behavior
> > >>>
> > >>>is
> > >>>
> > >>>>>>to modify QueryParser.jj so it doesn't do this. However,
> > >>>
> > >>>doing
> > >>>
> > >>>>>>this can cause some other problems, like hyphenated words at a
> > >>>>>>line break and the like.
> > >>>>>
> > >>>>>I've recently started using lucene and I'm running into the same
> > >>>>>issue with the query parser. I'd like to use queries that
> > >>>>>contain
> > >>>
> > >>>dashes
> > >>>
> > >>>>>in
> > >>>>>the field name, but as far as I can tell it seems that the
> > >>>
> > >>>current
> > >>>
> > >>>>>query
> > >>>>>grammar treats field names as terms, and so, as Terry notes, a
> > >>>
> > >>>dash
> > >>>
> > >>>>>becomes a token seperator.
> > >>>>>
> > >>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by
> > >>>>>creating a seperate non-terminal for field names.
> > >>>>>
> > >>>>>Has anyone done any work on this already? Is modifying
> > >>>>>QueryParser.jj the best approach?
> > >>>>>
> > >>>>>Thanks,
> > >>>>>jp
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]