subject:"'\-' character not interpreted correctly in field names"

Re: '-' character not interpreted correctly in field names (solution)

2003-07-10 Thread Victor Hadianto

Okay attached is the diff file to allow t-shirt to be interpreted as 
"t-shirt". Queries that start with a "-" character behave as expected, well 
at least as we expected. 

For example: -shirt +pants as -shirt +pants

One thing I need to mention is (I dig this from earlier discussion in this 
list), that Doug Cutting said this (about the similar change someone else 
propose):

< cut --->
Lixin Meng wrote:
> Therefore, it would be preferable to treat all hyphen in the same way.
> Either as a delimiter or as part of the word (maybe with a flag at the API).

If we change StandardTokenizer in this way then we risk breaking all the 
applications that currently use it and depend on its current behaviour. 
  So I'm reluctant to make this change.

 From the StandardTokenizer documentation:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

"Many applications have specific tokenizer needs. If this tokenizer does 
not suit your application, please consider copying this source code 
directory to your project and maintaining your own grammar-based tokenizer."

Also, if you construct a tokenizer that you think is more generally 
useful than StandardTokenizer, please contribute it by mailing it to one 
of the Lucene mailing lists.

Thanks,

Doug

< cut >

So yes this change _may_ break other exisiting applications.

cheers,
victor


On Thu, 10 Jul 2003 08:34 pm, Otis Gospodnetic wrote:
> I think this is a fine change, that others would welcome, too.
> No?
> Does your change work with queries that start with a '-' character?
> For example: -shirt +pants
> (note: no space before '-shirt')
>
> If so, I think we could include this change in QueryParser.jj if you
> send the diff, as I recall others wondering why queries like t-shirt
> get misinterpreted as +t -shirt.
>
> Thanks,
> Otis
>
> --- Victor Hadianto <[EMAIL PROTECTED]> wrote:
> > Eric and others,
> >
> > I finally found a solution for this problem, although it is really
> > specific to
> > our need.
> >
> > The simplest solution in the end is redefining what a "Term" is
> > about. At the
> > moment, QueryParser will parse the following:
> >
> > t-shirt as
> >
> > +t -shirt
> >
> > Which, in my opinion, is not really acceptable. A more sensible
> > parsing will
> > parse "t-shirt" as "t-shirt". If a user wants to do a query for "t"
> > without
> > the word "shirt" on it then the query should really be:
> >
> > t -shirt
> >  ^ space here.
> >
> > Similarly, a field query such as:
> >
> > model:t-shirt
> >
> > should really be interpreted as "model:t-shirt" not +model:t -shirt.
> > I this it
> > really make more sense to have the requirement of having a space
> > before the
> > "-" to identify a NOT query.
> >
> > Onward to the code change, as I have said earlier it is specific for
> > our
> > application use and thus may not be relevant to most other people.
> > Some of
> > our field name have the "-" sign in it. Thus by changing the
> > TERM_CHAR
> > definition to:
> >
> > <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
> >
> > makes QueryParser compatible with our need.
> >
> >
> > Cheers,
> >
> > Victor
> >
Index: src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj
===
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj,v
retrieving revision 1.3
diff -u -u -r1.3 StandardTokenizer.jj
--- src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj   5 Jun 2002 
04:54:47 -   1.3
+++ src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj   11 Jul 2003 
00:36:15 -
@@ -95,6 +95,9 @@
   // use a post-filter to remove possesives
 |  ("'" )+ >
 
+  // Include search string such as t-shirt, x-mailer or mail-client.
+|  ("-" )+ >
+
   // acronyms: U.S.A., I.B.M., etc.
   // use a post-filter to remove dots
 |  "." ( ".")+ >
@@ -177,6 +180,7 @@
 {
   ( token =  |
 token =  |
+token =  | 
 token =  |
 token =  |
 token =  |
Index: src/java/org/apache/lucene/queryParser/QueryParser.jj
===
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/queryParser/QueryParser.jj,v
retrieving revision 1.29
diff -u -u -r1.29 QueryParser.jj
--- src/java/org/apache/lucene/queryParser/QueryParser.jj   20 Mar 2003 18:28:13 
-  1.29
+++ src/java/org/apache/lucene/queryParser/QueryParser.jj   11 Jul 2003 00:36:15 
-
@@ -439,7 +439,7 @@
 | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
"[", "]", "\"", "{", "}", "~", "*", "?" ]
| <_ESCAPED_CHAR> ) >
-| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
+| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
 | <#_WHITESPACE: ( " " | "\t" ) >
 }

Re: '-' character not interpreted correctly in field names (solution)

2003-07-10 Thread Jan Agermose

+1

- Original Message - 
From: "Eric Jain" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, July 10, 2003 12:53 PM
Subject: Re: '-' character not interpreted correctly in field names
(solution)


> > I think this is a fine change, that others would welcome, too.
> > No?
> > Does your change work with queries that start with a '-' character?
> > For example: -shirt +pants
> > (note: no space before '-shirt')
> >
> > If so, I think we could include this change in QueryParser.jj if you
> > send the diff, as I recall others wondering why queries like t-shirt
> > get misinterpreted as +t -shirt.
>
> +1
>
> --
> Eric Jain
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: '-' character not interpreted correctly in field names (solution)

2003-07-10 Thread Eric Jain

> I think this is a fine change, that others would welcome, too.
> No?
> Does your change work with queries that start with a '-' character?
> For example: -shirt +pants
> (note: no space before '-shirt')
> 
> If so, I think we could include this change in QueryParser.jj if you
> send the diff, as I recall others wondering why queries like t-shirt
> get misinterpreted as +t -shirt.

+1

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: '-' character not interpreted correctly in field names (solution)

2003-07-10 Thread Otis Gospodnetic

I think this is a fine change, that others would welcome, too.
No?
Does your change work with queries that start with a '-' character?
For example: -shirt +pants
(note: no space before '-shirt')

If so, I think we could include this change in QueryParser.jj if you
send the diff, as I recall others wondering why queries like t-shirt
get misinterpreted as +t -shirt.

Thanks,
Otis

--- Victor Hadianto <[EMAIL PROTECTED]> wrote:
> Eric and others,
> 
> I finally found a solution for this problem, although it is really
> specific to 
> our need.
> 
> The simplest solution in the end is redefining what a "Term" is
> about. At the 
> moment, QueryParser will parse the following:
> 
> t-shirt as
> 
> +t -shirt
> 
> Which, in my opinion, is not really acceptable. A more sensible
> parsing will 
> parse "t-shirt" as "t-shirt". If a user wants to do a query for "t"
> without 
> the word "shirt" on it then the query should really be:
> 
> t -shirt
>  ^ space here.
> 
> Similarly, a field query such as:
> 
> model:t-shirt
> 
> should really be interpreted as "model:t-shirt" not +model:t -shirt.
> I this it 
> really make more sense to have the requirement of having a space
> before the 
> "-" to identify a NOT query.
> 
> Onward to the code change, as I have said earlier it is specific for
> our 
> application use and thus may not be relevant to most other people.
> Some of 
> our field name have the "-" sign in it. Thus by changing the
> TERM_CHAR 
> definition to:
> 
> <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
> 
> makes QueryParser compatible with our need. 
> 
> 
> Cheers,
> 
> Victor
> 
> 
> On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote:
> > Yep tried that. Actually there is more to the creation of the field
> than
> > just in this line:
> >
> > fieldToken=  { field = fieldToken.image; }
> >
> >
> > Because I've created a  which is exactly the same with
> 
> > which
> >
> > looks like this:
> > |  (<_TERM_CHAR>)*  >
> >
> > and change fieldToken to:
> >
> > fieldToken=  { field = fieldToken.image; }
> >
> > And it doesn't work. Simple query such as to:tom* is parsed as
> blank query.
> >
> > I will continue looking at this problem and will post my solution
> if I get
> > it, in the mean time I really do appreciate any help and
> suggestions.
> >
> > cheers,
> >
> > victor
> >
> > On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> > > You left out the ~ character in your _FIELDNAME_START_CHAR
> production.
> > > That character tells the grammar that it should take all the
> characters
> > > except the ones you specified (the complement).
> > >
> > > Change:
> > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > >
> > > To:
> > > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > >
> > > and it should probably work.
> > >
> > > Eric
> > >
> > > -Original Message-
> > > From: Victor Hadianto [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, July 09, 2003 4:53 AM
> > > To: Lucene Users List
> > > Subject: Re: '-' character not interpreted correctly in field
> names
> > >
> > >
> > > Hi Erik and others,
> > >
> > > I'm looking for a similar solution where I need QueryParser not
> to drop
> > > the "-" characters from the field name. Hower outside the field I
> do want
> > > the - sign interpreted as "not" modifier.
> > >
> > > I'm definitely not an expert in JavaCC and to be honest I only
> have a
> > > limited idea about Erik's suggestion work,
> > >
> > > Anyway I followed the suggestion and added the following:
> > > | <#_WHITESPACE: ( " " | "\t" ) >
> > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > > | "^",
> > >
> > >"[", "]", "\"", "{", "}", "~",
>

Re: '-' character not interpreted correctly in field names (solution)

2003-07-09 Thread Victor Hadianto

Eric and others,

I finally found a solution for this problem, although it is really specific to 
our need.

The simplest solution in the end is redefining what a "Term" is about. At the 
moment, QueryParser will parse the following:

t-shirt as

+t -shirt

Which, in my opinion, is not really acceptable. A more sensible parsing will 
parse "t-shirt" as "t-shirt". If a user wants to do a query for "t" without 
the word "shirt" on it then the query should really be:

t -shirt
 ^ space here.

Similarly, a field query such as:

model:t-shirt

should really be interpreted as "model:t-shirt" not +model:t -shirt. I this it 
really make more sense to have the requirement of having a space before the 
"-" to identify a NOT query.

Onward to the code change, as I have said earlier it is specific for our 
application use and thus may not be relevant to most other people. Some of 
our field name have the "-" sign in it. Thus by changing the TERM_CHAR 
definition to:

<#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >

makes QueryParser compatible with our need. 


Cheers,

Victor


On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote:
> Yep tried that. Actually there is more to the creation of the field than
> just in this line:
>
> fieldToken=  { field = fieldToken.image; }
>
>
> Because I've created a  which is exactly the same with 
> which
>
> looks like this:
> |  (<_TERM_CHAR>)*  >
>
> and change fieldToken to:
>
> fieldToken=  { field = fieldToken.image; }
>
> And it doesn't work. Simple query such as to:tom* is parsed as blank query.
>
> I will continue looking at this problem and will post my solution if I get
> it, in the mean time I really do appreciate any help and suggestions.
>
> cheers,
>
> victor
>
> On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> > You left out the ~ character in your _FIELDNAME_START_CHAR production.
> > That character tells the grammar that it should take all the characters
> > except the ones you specified (the complement).
> >
> > Change:
> > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> >
> > To:
> > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> >
> > and it should probably work.
> >
> > Eric
> >
> > -Original Message-
> > From: Victor Hadianto [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 09, 2003 4:53 AM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> >
> > Hi Erik and others,
> >
> > I'm looking for a similar solution where I need QueryParser not to drop
> > the "-" characters from the field name. Hower outside the field I do want
> > the - sign interpreted as "not" modifier.
> >
> > I'm definitely not an expert in JavaCC and to be honest I only have a
> > limited idea about Erik's suggestion work,
> >
> > Anyway I followed the suggestion and added the following:
> > | <#_WHITESPACE: ( " " | "\t" ) >
> > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> > | "^",
> >
> >"[", "]", "\"", "{", "}", "~", "*", "?" ]
> >
> >  | <_ESCAPED_CHAR> ) >
> > |
> > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >
> >
> > and again below I added:
> > |  (<_TERM_CHAR>)*  >
> > |  (<_FIELDNAME_CHAR>)*  >
> >
> > And I changed:
> >
> > LOOKAHEAD(2)
> > fieldToken=  { field = fieldToken.image; }
> >
> > to: ...
> >
> > LOOKAHEAD(2)
> > fieldToken=  { field = fieldToken.image; }
> >
> >
> > Well after doing all this mods all the query that involved field names
> > cause problem, for example if I searched for
> >
> > fieldname:hello
> >
> > The query is blank (yes blank, nothing in it)
> >
> > and if the fieldname does contain a dash ("-") for example:
> > field-name:hello
> >
> > They query is: +field -name
> >
>

Re: '-' character not interpreted correctly in field names

2003-07-09 Thread Victor Hadianto

Yep tried that. Actually there is more to the creation of the field than just 
in this line:

fieldToken=  { field = fieldToken.image; }


Because I've created a  which is exactly the same with  which 
looks like this:
|  (<_TERM_CHAR>)*  >

and change fieldToken to:

fieldToken=  { field = fieldToken.image; }

And it doesn't work. Simple query such as to:tom* is parsed as blank query.

I will continue looking at this problem and will post my solution if I get it, 
in the mean time I really do appreciate any help and suggestions.

cheers,

victor


On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> You left out the ~ character in your _FIELDNAME_START_CHAR production. That
> character tells the grammar that it should take all the characters except
> the ones you specified (the complement).
>
> Change:
> | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
>
> To:
> | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
>
> and it should probably work.
>
> Eric
>
> -Original Message-
> From: Victor Hadianto [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 09, 2003 4:53 AM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
>
>
> Hi Erik and others,
>
> I'm looking for a similar solution where I need QueryParser not to drop the
> "-" characters from the field name. Hower outside the field I do want the -
> sign interpreted as "not" modifier.
>
> I'm definitely not an expert in JavaCC and to be honest I only have a
> limited idea about Erik's suggestion work,
>
> Anyway I followed the suggestion and added the following:
> | <#_WHITESPACE: ( " " | "\t" ) >
> | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> | "^",
>
>"[", "]", "\"", "{", "}", "~", "*", "?" ]
>
>  | <_ESCAPED_CHAR> ) >
> |
> | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> and again below I added:
> |  (<_TERM_CHAR>)*  >
> |  (<_FIELDNAME_CHAR>)*  >
>
> And I changed:
>
> LOOKAHEAD(2)
> fieldToken=  { field = fieldToken.image; }
>
> to: ...
>
> LOOKAHEAD(2)
> fieldToken=  { field = fieldToken.image; }
>
>
> Well after doing all this mods all the query that involved field names
> cause problem, for example if I searched for
>
> fieldname:hello
>
> The query is blank (yes blank, nothing in it)
>
> and if the fieldname does contain a dash ("-") for example:
> field-name:hello
>
> They query is: +field -name
>
> hello is dropped.
>
>
> Does anyone has any idea? Help and suggestions will be much appreciated. I
> really need to get this dash working, changing the field name will be my
> last resort which I won't explore until I really have to.
>
>
> Thanks,
>
> Victor
>
> On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> > I think the query parser changes would not be too bad, I've outlined a
> > couple of relavant lines you should look at so you don't have to try
> > and comprehend the productions for the entire QueryParser. I do not
> > think I would like to have to maintain one of those myself though.
> > Your other unmentioned alternative is to choose field names that match
> > the  production of QueryParser.jj without escapes.
> >
> > QueryParser.jj line 557:
> > fieldToken=  { field = fieldToken.image; }
> >
> > and earlier...
> >  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
> >   "[", "]", "\"", "{", "}", "~", "*", "?" ] >
> >
> > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> > | "^",
> >
> >    "[", "]", "\"", "{", "}", "~", "*", "?" ]
> >
> >| <_ESCAPED_CHAR> ) >
> > |
> > | <#_

RE: '-' character not interpreted correctly in field names

2003-07-09 Thread Eric Isakson

You left out the ~ character in your _FIELDNAME_START_CHAR production. That character 
tells the grammar that it should take all the characters except the ones you specified 
(the complement).

Change:

| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", 

To:

| <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", 

and it should probably work.

Eric

-----Original Message-----
From: Victor Hadianto [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 09, 2003 4:53 AM
To: Lucene Users List
Subject: Re: '-' character not interpreted correctly in field names


Hi Erik and others,

I'm looking for a similar solution where I need QueryParser not to drop the 
"-" characters from the field name. Hower outside the field I do want the - 
sign interpreted as "not" modifier. 

I'm definitely not an expert in JavaCC and to be honest I only have a limited 
idea about Erik's suggestion work,

Anyway I followed the suggestion and added the following:

| <#_WHITESPACE: ( " " | "\t" ) >
| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", 
| "^",
   "[", "]", "\"", "{", "}", "~", "*", "?" ]
 | <_ESCAPED_CHAR> ) >
| <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >

and again below I added:


|  (<_TERM_CHAR>)*  >
|  (<_FIELDNAME_CHAR>)*  >

And I changed:

LOOKAHEAD(2)
fieldToken=  { field = fieldToken.image; }

to: ...

LOOKAHEAD(2)
fieldToken=  { field = fieldToken.image; }


Well after doing all this mods all the query that involved field names cause 
problem, for example if I searched for

fieldname:hello

The query is blank (yes blank, nothing in it)

and if the fieldname does contain a dash ("-") for example: field-name:hello

They query is: +field -name

hello is dropped.


Does anyone has any idea? Help and suggestions will be much appreciated. I 
really need to get this dash working, changing the field name will be my last 
resort which I won't explore until I really have to.


Thanks,

Victor


On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> I think the query parser changes would not be too bad, I've outlined a 
> couple of relavant lines you should look at so you don't have to try 
> and comprehend the productions for the entire QueryParser. I do not 
> think I would like to have to maintain one of those myself though. 
> Your other unmentioned alternative is to choose field names that match 
> the  production of QueryParser.jj without escapes.
>
> QueryParser.jj line 557:
> fieldToken=  { field = fieldToken.image; }
>
> and earlier...
>  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
>   "[", "]", "\"", "{", "}", "~", "*", "?" ] >
>
> | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", 
> | "^",
>
>"[", "]", "\"", "{", "}", "~", "*", "?" ]
>
>| <_ESCAPED_CHAR> ) >
> |
> | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> ...
>
>  (<_TERM_CHAR>)*  >
>
> So the characters you need to avoid in your field names are the ones 
> from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", 
> "]", "\"", "{", "}", "~", "*", "?" ]
>
> If you need to modify the parser, you will probably want to add a 
> FIELDNAME token and other supporting productions that look really 
> similar to these lines I've copied but modify the complement, ~[...], 
> at the beginning of _FIELDNAME_START_CHAR (you would add this 
> production) so it will match the "-" that you are using in your field 
> names (and fix it to match any other characters you want to use in 
> field names that it doesn't allow right now).
>
> Eric
>
> -Original Message-
> From: Jon Pipitone [mailto:[EMAIL

Re: '-' character not interpreted correctly in field names

2003-07-09 Thread Victor Hadianto

Hi Erik and others,

I'm looking for a similar solution where I need QueryParser not to drop the 
"-" characters from the field name. Hower outside the field I do want the - 
sign interpreted as "not" modifier. 

I'm definitely not an expert in JavaCC and to be honest I only have a limited 
idea about Erik's suggestion work,

Anyway I followed the suggestion and added the following:

| <#_WHITESPACE: ( " " | "\t" ) >
| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
   "[", "]", "\"", "{", "}", "~", "*", "?" ]
 | <_ESCAPED_CHAR> ) >
| <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >

and again below I added:


|  (<_TERM_CHAR>)*  >
|  (<_FIELDNAME_CHAR>)*  >

And I changed:

LOOKAHEAD(2)
fieldToken=  { field = fieldToken.image; }

to: ...

LOOKAHEAD(2)
fieldToken=  { field = fieldToken.image; }


Well after doing all this mods all the query that involved field names cause 
problem, for example if I searched for

fieldname:hello

The query is blank (yes blank, nothing in it)

and if the fieldname does contain a dash ("-") for example: field-name:hello

They query is: +field -name

hello is dropped.


Does anyone has any idea? Help and suggestions will be much appreciated. I 
really need to get this dash working, changing the field name will be my last 
resort which I won't explore until I really have to.


Thanks,

Victor


On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> I think the query parser changes would not be too bad, I've outlined a
> couple of relavant lines you should look at so you don't have to try and
> comprehend the productions for the entire QueryParser. I do not think I
> would like to have to maintain one of those myself though. Your other
> unmentioned alternative is to choose field names that match the 
> production of QueryParser.jj without escapes.
>
> QueryParser.jj line 557:
> fieldToken=  { field = fieldToken.image; }
>
> and earlier...
>  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
>   "[", "]", "\"", "{", "}", "~", "*", "?" ] >
>
> | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
>
>"[", "]", "\"", "{", "}", "~", "*", "?" ]
>
>| <_ESCAPED_CHAR> ) >
> |
> | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> ...
>
>  (<_TERM_CHAR>)*  >
>
> So the characters you need to avoid in your field names are the ones from
> _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"",
> "{", "}", "~", "*", "?" ]
>
> If you need to modify the parser, you will probably want to add a FIELDNAME
> token and other supporting productions that look really similar to these
> lines I've copied but modify the complement, ~[...], at the beginning of
> _FIELDNAME_START_CHAR (you would add this production) so it will match the
> "-" that you are using in your field names (and fix it to match any other
> characters you want to use in field names that it doesn't allow right now).
>
> Eric
>
> -Original Message-
> From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 14, 2003 2:26 PM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
>
> Eric Isakson wrote:
> > I just looked at the QueryParser.jj code, your field names
> >
>  > never get processed by the analyzer. It does look like the
>  > query parser will honor escapes though. I haven't tried
>  > this, but try a query like "foo\-bar:foo" and have
> >
> > a look at the QueryParser.jj file for how it handles field
> >
>  > names when parsing your query.
>
> Hrm.. that's what I had found too.  So, you're saying that, other than
> escaping dashes, I'd have to change QueryParser.. ?
>
> I'm not too familiar ju

Re: '-' character not interpreted correctly in field names

2003-02-03 Thread Tatu Saloranta

On Monday 03 February 2003 07:19, Terry Steichen wrote:
> I believe that the tokenizer treats a dash as a token separator.  Hence,
> the only way, as I recall, to eliminate this behavior is to modify
> QueryParser.jj so it doesn't do this.  However, doing this can cause some
> other problems, like hyphenated words at a line break and the like.

It might be enough to just replace analyzer passed in to QueryParser
to do this? This is the case if QueryParser only handles modifiers outside
terms, and terms are passed to analyzer.
I think this is the case (QueryParser does  call the analyzer in couple of 
places, and one word may actually expand to a phrase or vice versa)?

Still, it seems like using a hyphen as separator shouldn't necessarily cause 
big problems when indexer does the same; queries against "2 - 5" would be 
phrase queries for "2 5", which is still reasonably specific (and should 
match the content).

On the other hand, simple analyzer and standard analyzer have pretty different 
tokenization rules, so it's important to make sure same analyzer is used for 
both indexing and searching (that mismatch can prevent matches easily).

-+ Tatu +-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: '-' character not interpreted correctly in field names

2003-02-03 Thread Terry Steichen

I believe that the tokenizer treats a dash as a token separator.  Hence, the
only way, as I recall, to eliminate this behavior is to modify
QueryParser.jj so it doesn't do this.  However, doing this can cause some
other problems, like hyphenated words at a line break and the like.

(Of course, if you do make such a change, you'll have to go back and reindex
after such a change.)

I've run into this problem myself and I've 'punted' -  on certain fields,
when I index, I replace the dash with an underscore.  This isn't a real good
solution, and it does require me to keep remembering in which fields I have
to do this substitution in the search.  But, for the moment it works.  I'll
probably go back and make some kind of change later, when I have more time.

HTH,

Terry

- Original Message -
From: "hermit" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, February 03, 2003 2:39 AM
Subject: '-' character not interpreted correctly in field names


> Hello!
>
> I have a problem, a big one. I have successfully indexed 600 MB of XML
> data, but the search can't give any results if the field contains any
> '-' characters .
> For example: compound@cgx-code:[2 - 5] must match at least two results
> based on my XML data but it gives nothing.
>
> Can you advice me a simple solution? Or is it a bug?
>
> The Hermit
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

'-' character not interpreted correctly in field names

2003-02-02 Thread hermit

Hello!

I have a problem, a big one. I have successfully indexed 600 MB of XML 
data, but the search can't give any results if the field contains any  
'-' characters .
For example: compound@cgx-code:[2 - 5] must match at least two results 
based on my XML data but it gives nothing.

Can you advice me a simple solution? Or is it a bug?

   The Hermit


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: '-' character not interpreted correctly in field names (solution)

Re: '-' character not interpreted correctly in field names (solution)

Re: '-' character not interpreted correctly in field names (solution)

Re: '-' character not interpreted correctly in field names (solution)

Re: '-' character not interpreted correctly in field names (solution)

Re: '-' character not interpreted correctly in field names

RE: '-' character not interpreted correctly in field names

Re: '-' character not interpreted correctly in field names

Re: '-' character not interpreted correctly in field names

Re: '-' character not interpreted correctly in field names

'-' character not interpreted correctly in field names

11 matches

Site Navigation

Mail list logo

Footer information