Re: '-' character not interpreted correctly in field names (solution)
Okay attached is the diff file to allow t-shirt to be interpreted as "t-shirt". Queries that start with a "-" character behave as expected, well at least as we expected. For example: -shirt +pants as -shirt +pants One thing I need to mention is (I dig this from earlier discussion in this list), that Doug Cutting said this (about the similar change someone else propose): < cut ---> Lixin Meng wrote: > Therefore, it would be preferable to treat all hyphen in the same way. > Either as a delimiter or as part of the word (maybe with a flag at the API). If we change StandardTokenizer in this way then we risk breaking all the applications that currently use it and depend on its current behaviour. So I'm reluctant to make this change. From the StandardTokenizer documentation: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html "Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer." Also, if you construct a tokenizer that you think is more generally useful than StandardTokenizer, please contribute it by mailing it to one of the Lucene mailing lists. Thanks, Doug < cut > So yes this change _may_ break other exisiting applications. cheers, victor On Thu, 10 Jul 2003 08:34 pm, Otis Gospodnetic wrote: > I think this is a fine change, that others would welcome, too. > No? > Does your change work with queries that start with a '-' character? > For example: -shirt +pants > (note: no space before '-shirt') > > If so, I think we could include this change in QueryParser.jj if you > send the diff, as I recall others wondering why queries like t-shirt > get misinterpreted as +t -shirt. > > Thanks, > Otis > > --- Victor Hadianto <[EMAIL PROTECTED]> wrote: > > Eric and others, > > > > I finally found a solution for this problem, although it is really > > specific to > > our need. > > > > The simplest solution in the end is redefining what a "Term" is > > about. At the > > moment, QueryParser will parse the following: > > > > t-shirt as > > > > +t -shirt > > > > Which, in my opinion, is not really acceptable. A more sensible > > parsing will > > parse "t-shirt" as "t-shirt". If a user wants to do a query for "t" > > without > > the word "shirt" on it then the query should really be: > > > > t -shirt > > ^ space here. > > > > Similarly, a field query such as: > > > > model:t-shirt > > > > should really be interpreted as "model:t-shirt" not +model:t -shirt. > > I this it > > really make more sense to have the requirement of having a space > > before the > > "-" to identify a NOT query. > > > > Onward to the code change, as I have said earlier it is specific for > > our > > application use and thus may not be relevant to most other people. > > Some of > > our field name have the "-" sign in it. Thus by changing the > > TERM_CHAR > > definition to: > > > > <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) > > > > > makes QueryParser compatible with our need. > > > > > > Cheers, > > > > Victor > > Index: src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj === RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj,v retrieving revision 1.3 diff -u -u -r1.3 StandardTokenizer.jj --- src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj 5 Jun 2002 04:54:47 - 1.3 +++ src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj 11 Jul 2003 00:36:15 - @@ -95,6 +95,9 @@ // use a post-filter to remove possesives | ("'" )+ > + // Include search string such as t-shirt, x-mailer or mail-client. +| ("-" )+ > + // acronyms: U.S.A., I.B.M., etc. // use a post-filter to remove dots | "." ( ".")+ > @@ -177,6 +180,7 @@ { ( token = | token = | +token = | token = | token = | token = | Index: src/java/org/apache/lucene/queryParser/QueryParser.jj === RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/queryParser/QueryParser.jj,v retrieving revision 1.29 diff -u -u -r1.29 QueryParser.jj --- src/java/org/apache/lucene/queryParser/QueryParser.jj 20 Mar 2003 18:28:13 - 1.29 +++ src/java/org/apache/lucene/queryParser/QueryParser.jj 11 Jul 2003 00:36:15 - @@ -439,7 +439,7 @@ | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "{", "}", "~", "*", "?" ] | <_ESCAPED_CHAR> ) > -| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) > +| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) > | <#_WHITESPACE: ( " " | "\t" ) > }
Re: '-' character not interpreted correctly in field names (solution)
+1 - Original Message - From: "Eric Jain" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, July 10, 2003 12:53 PM Subject: Re: '-' character not interpreted correctly in field names (solution) > > I think this is a fine change, that others would welcome, too. > > No? > > Does your change work with queries that start with a '-' character? > > For example: -shirt +pants > > (note: no space before '-shirt') > > > > If so, I think we could include this change in QueryParser.jj if you > > send the diff, as I recall others wondering why queries like t-shirt > > get misinterpreted as +t -shirt. > > +1 > > -- > Eric Jain > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: '-' character not interpreted correctly in field names (solution)
> I think this is a fine change, that others would welcome, too. > No? > Does your change work with queries that start with a '-' character? > For example: -shirt +pants > (note: no space before '-shirt') > > If so, I think we could include this change in QueryParser.jj if you > send the diff, as I recall others wondering why queries like t-shirt > get misinterpreted as +t -shirt. +1 -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: '-' character not interpreted correctly in field names (solution)
I think this is a fine change, that others would welcome, too. No? Does your change work with queries that start with a '-' character? For example: -shirt +pants (note: no space before '-shirt') If so, I think we could include this change in QueryParser.jj if you send the diff, as I recall others wondering why queries like t-shirt get misinterpreted as +t -shirt. Thanks, Otis --- Victor Hadianto <[EMAIL PROTECTED]> wrote: > Eric and others, > > I finally found a solution for this problem, although it is really > specific to > our need. > > The simplest solution in the end is redefining what a "Term" is > about. At the > moment, QueryParser will parse the following: > > t-shirt as > > +t -shirt > > Which, in my opinion, is not really acceptable. A more sensible > parsing will > parse "t-shirt" as "t-shirt". If a user wants to do a query for "t" > without > the word "shirt" on it then the query should really be: > > t -shirt > ^ space here. > > Similarly, a field query such as: > > model:t-shirt > > should really be interpreted as "model:t-shirt" not +model:t -shirt. > I this it > really make more sense to have the requirement of having a space > before the > "-" to identify a NOT query. > > Onward to the code change, as I have said earlier it is specific for > our > application use and thus may not be relevant to most other people. > Some of > our field name have the "-" sign in it. Thus by changing the > TERM_CHAR > definition to: > > <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) > > > makes QueryParser compatible with our need. > > > Cheers, > > Victor > > > On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote: > > Yep tried that. Actually there is more to the creation of the field > than > > just in this line: > > > > fieldToken= { field = fieldToken.image; } > > > > > > Because I've created a which is exactly the same with > > > which > > > > looks like this: > > | (<_TERM_CHAR>)* > > > > > and change fieldToken to: > > > > fieldToken= { field = fieldToken.image; } > > > > And it doesn't work. Simple query such as to:tom* is parsed as > blank query. > > > > I will continue looking at this problem and will post my solution > if I get > > it, in the mean time I really do appreciate any help and > suggestions. > > > > cheers, > > > > victor > > > > On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote: > > > You left out the ~ character in your _FIELDNAME_START_CHAR > production. > > > That character tells the grammar that it should take all the > characters > > > except the ones you specified (the complement). > > > > > > Change: > > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", > ")", ":", > > > > > > To: > > > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", > ")", ":", > > > > > > and it should probably work. > > > > > > Eric > > > > > > -Original Message- > > > From: Victor Hadianto [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, July 09, 2003 4:53 AM > > > To: Lucene Users List > > > Subject: Re: '-' character not interpreted correctly in field > names > > > > > > > > > Hi Erik and others, > > > > > > I'm looking for a similar solution where I need QueryParser not > to drop > > > the "-" characters from the field name. Hower outside the field I > do want > > > the - sign interpreted as "not" modifier. > > > > > > I'm definitely not an expert in JavaCC and to be honest I only > have a > > > limited idea about Erik's suggestion work, > > > > > > Anyway I followed the suggestion and added the following: > > > | <#_WHITESPACE: ( " " | "\t" ) > > > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", > ")", ":", > > > | "^", > > > > > >"[", "]", "\"", "{", "}", "~", >
Re: '-' character not interpreted correctly in field names (solution)
Eric and others, I finally found a solution for this problem, although it is really specific to our need. The simplest solution in the end is redefining what a "Term" is about. At the moment, QueryParser will parse the following: t-shirt as +t -shirt Which, in my opinion, is not really acceptable. A more sensible parsing will parse "t-shirt" as "t-shirt". If a user wants to do a query for "t" without the word "shirt" on it then the query should really be: t -shirt ^ space here. Similarly, a field query such as: model:t-shirt should really be interpreted as "model:t-shirt" not +model:t -shirt. I this it really make more sense to have the requirement of having a space before the "-" to identify a NOT query. Onward to the code change, as I have said earlier it is specific for our application use and thus may not be relevant to most other people. Some of our field name have the "-" sign in it. Thus by changing the TERM_CHAR definition to: <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) > makes QueryParser compatible with our need. Cheers, Victor On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote: > Yep tried that. Actually there is more to the creation of the field than > just in this line: > > fieldToken= { field = fieldToken.image; } > > > Because I've created a which is exactly the same with > which > > looks like this: > | (<_TERM_CHAR>)* > > > and change fieldToken to: > > fieldToken= { field = fieldToken.image; } > > And it doesn't work. Simple query such as to:tom* is parsed as blank query. > > I will continue looking at this problem and will post my solution if I get > it, in the mean time I really do appreciate any help and suggestions. > > cheers, > > victor > > On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote: > > You left out the ~ character in your _FIELDNAME_START_CHAR production. > > That character tells the grammar that it should take all the characters > > except the ones you specified (the complement). > > > > Change: > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", > > > > To: > > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", > > > > and it should probably work. > > > > Eric > > > > -Original Message- > > From: Victor Hadianto [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 09, 2003 4:53 AM > > To: Lucene Users List > > Subject: Re: '-' character not interpreted correctly in field names > > > > > > Hi Erik and others, > > > > I'm looking for a similar solution where I need QueryParser not to drop > > the "-" characters from the field name. Hower outside the field I do want > > the - sign interpreted as "not" modifier. > > > > I'm definitely not an expert in JavaCC and to be honest I only have a > > limited idea about Erik's suggestion work, > > > > Anyway I followed the suggestion and added the following: > > | <#_WHITESPACE: ( " " | "\t" ) > > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", > > | "^", > > > >"[", "]", "\"", "{", "}", "~", "*", "?" ] > > > > | <_ESCAPED_CHAR> ) > > > | > > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) > > > > > and again below I added: > > | (<_TERM_CHAR>)* > > > | (<_FIELDNAME_CHAR>)* > > > > > And I changed: > > > > LOOKAHEAD(2) > > fieldToken= { field = fieldToken.image; } > > > > to: ... > > > > LOOKAHEAD(2) > > fieldToken= { field = fieldToken.image; } > > > > > > Well after doing all this mods all the query that involved field names > > cause problem, for example if I searched for > > > > fieldname:hello > > > > The query is blank (yes blank, nothing in it) > > > > and if the fieldname does contain a dash ("-") for example: > > field-name:hello > > > > They query is: +field -name > > >
Re: '-' character not interpreted correctly in field names
Yep tried that. Actually there is more to the creation of the field than just in this line: fieldToken= { field = fieldToken.image; } Because I've created a which is exactly the same with which looks like this: | (<_TERM_CHAR>)* > and change fieldToken to: fieldToken= { field = fieldToken.image; } And it doesn't work. Simple query such as to:tom* is parsed as blank query. I will continue looking at this problem and will post my solution if I get it, in the mean time I really do appreciate any help and suggestions. cheers, victor On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote: > You left out the ~ character in your _FIELDNAME_START_CHAR production. That > character tells the grammar that it should take all the characters except > the ones you specified (the complement). > > Change: > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", > > To: > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", > > and it should probably work. > > Eric > > -Original Message- > From: Victor Hadianto [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 09, 2003 4:53 AM > To: Lucene Users List > Subject: Re: '-' character not interpreted correctly in field names > > > Hi Erik and others, > > I'm looking for a similar solution where I need QueryParser not to drop the > "-" characters from the field name. Hower outside the field I do want the - > sign interpreted as "not" modifier. > > I'm definitely not an expert in JavaCC and to be honest I only have a > limited idea about Erik's suggestion work, > > Anyway I followed the suggestion and added the following: > | <#_WHITESPACE: ( " " | "\t" ) > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", > | "^", > >"[", "]", "\"", "{", "}", "~", "*", "?" ] > > | <_ESCAPED_CHAR> ) > > | > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) > > > and again below I added: > | (<_TERM_CHAR>)* > > | (<_FIELDNAME_CHAR>)* > > > And I changed: > > LOOKAHEAD(2) > fieldToken= { field = fieldToken.image; } > > to: ... > > LOOKAHEAD(2) > fieldToken= { field = fieldToken.image; } > > > Well after doing all this mods all the query that involved field names > cause problem, for example if I searched for > > fieldname:hello > > The query is blank (yes blank, nothing in it) > > and if the fieldname does contain a dash ("-") for example: > field-name:hello > > They query is: +field -name > > hello is dropped. > > > Does anyone has any idea? Help and suggestions will be much appreciated. I > really need to get this dash working, changing the field name will be my > last resort which I won't explore until I really have to. > > > Thanks, > > Victor > > On Thu, 15 May 2003 04:54 am, Eric Isakson wrote: > > I think the query parser changes would not be too bad, I've outlined a > > couple of relavant lines you should look at so you don't have to try > > and comprehend the productions for the entire QueryParser. I do not > > think I would like to have to maintain one of those myself though. > > Your other unmentioned alternative is to choose field names that match > > the production of QueryParser.jj without escapes. > > > > QueryParser.jj line 557: > > fieldToken= { field = fieldToken.image; } > > > > and earlier... > > <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^", > > "[", "]", "\"", "{", "}", "~", "*", "?" ] > > > > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", > > | "^", > > > > "[", "]", "\"", "{", "}", "~", "*", "?" ] > > > >| <_ESCAPED_CHAR> ) > > > | > > | <#_
RE: '-' character not interpreted correctly in field names
You left out the ~ character in your _FIELDNAME_START_CHAR production. That character tells the grammar that it should take all the characters except the ones you specified (the complement). Change: | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", To: | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", and it should probably work. Eric -----Original Message----- From: Victor Hadianto [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 09, 2003 4:53 AM To: Lucene Users List Subject: Re: '-' character not interpreted correctly in field names Hi Erik and others, I'm looking for a similar solution where I need QueryParser not to drop the "-" characters from the field name. Hower outside the field I do want the - sign interpreted as "not" modifier. I'm definitely not an expert in JavaCC and to be honest I only have a limited idea about Erik's suggestion work, Anyway I followed the suggestion and added the following: | <#_WHITESPACE: ( " " | "\t" ) > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", | "^", "[", "]", "\"", "{", "}", "~", "*", "?" ] | <_ESCAPED_CHAR> ) > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) > and again below I added: | (<_TERM_CHAR>)* > | (<_FIELDNAME_CHAR>)* > And I changed: LOOKAHEAD(2) fieldToken= { field = fieldToken.image; } to: ... LOOKAHEAD(2) fieldToken= { field = fieldToken.image; } Well after doing all this mods all the query that involved field names cause problem, for example if I searched for fieldname:hello The query is blank (yes blank, nothing in it) and if the fieldname does contain a dash ("-") for example: field-name:hello They query is: +field -name hello is dropped. Does anyone has any idea? Help and suggestions will be much appreciated. I really need to get this dash working, changing the field name will be my last resort which I won't explore until I really have to. Thanks, Victor On Thu, 15 May 2003 04:54 am, Eric Isakson wrote: > I think the query parser changes would not be too bad, I've outlined a > couple of relavant lines you should look at so you don't have to try > and comprehend the productions for the entire QueryParser. I do not > think I would like to have to maintain one of those myself though. > Your other unmentioned alternative is to choose field names that match > the production of QueryParser.jj without escapes. > > QueryParser.jj line 557: > fieldToken= { field = fieldToken.image; } > > and earlier... > <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^", > "[", "]", "\"", "{", "}", "~", "*", "?" ] > > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", > | "^", > >"[", "]", "\"", "{", "}", "~", "*", "?" ] > >| <_ESCAPED_CHAR> ) > > | > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) > > > ... > > (<_TERM_CHAR>)* > > > So the characters you need to avoid in your field names are the ones > from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", > "]", "\"", "{", "}", "~", "*", "?" ] > > If you need to modify the parser, you will probably want to add a > FIELDNAME token and other supporting productions that look really > similar to these lines I've copied but modify the complement, ~[...], > at the beginning of _FIELDNAME_START_CHAR (you would add this > production) so it will match the "-" that you are using in your field > names (and fix it to match any other characters you want to use in > field names that it doesn't allow right now). > > Eric > > -Original Message- > From: Jon Pipitone [mailto:[EMAIL
Re: '-' character not interpreted correctly in field names
Hi Erik and others, I'm looking for a similar solution where I need QueryParser not to drop the "-" characters from the field name. Hower outside the field I do want the - sign interpreted as "not" modifier. I'm definitely not an expert in JavaCC and to be honest I only have a limited idea about Erik's suggestion work, Anyway I followed the suggestion and added the following: | <#_WHITESPACE: ( " " | "\t" ) > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "{", "}", "~", "*", "?" ] | <_ESCAPED_CHAR> ) > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) > and again below I added: | (<_TERM_CHAR>)* > | (<_FIELDNAME_CHAR>)* > And I changed: LOOKAHEAD(2) fieldToken= { field = fieldToken.image; } to: ... LOOKAHEAD(2) fieldToken= { field = fieldToken.image; } Well after doing all this mods all the query that involved field names cause problem, for example if I searched for fieldname:hello The query is blank (yes blank, nothing in it) and if the fieldname does contain a dash ("-") for example: field-name:hello They query is: +field -name hello is dropped. Does anyone has any idea? Help and suggestions will be much appreciated. I really need to get this dash working, changing the field name will be my last resort which I won't explore until I really have to. Thanks, Victor On Thu, 15 May 2003 04:54 am, Eric Isakson wrote: > I think the query parser changes would not be too bad, I've outlined a > couple of relavant lines you should look at so you don't have to try and > comprehend the productions for the entire QueryParser. I do not think I > would like to have to maintain one of those myself though. Your other > unmentioned alternative is to choose field names that match the > production of QueryParser.jj without escapes. > > QueryParser.jj line 557: > fieldToken= { field = fieldToken.image; } > > and earlier... > <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^", > "[", "]", "\"", "{", "}", "~", "*", "?" ] > > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^", > >"[", "]", "\"", "{", "}", "~", "*", "?" ] > >| <_ESCAPED_CHAR> ) > > | > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) > > > ... > > (<_TERM_CHAR>)* > > > So the characters you need to avoid in your field names are the ones from > _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", > "{", "}", "~", "*", "?" ] > > If you need to modify the parser, you will probably want to add a FIELDNAME > token and other supporting productions that look really similar to these > lines I've copied but modify the complement, ~[...], at the beginning of > _FIELDNAME_START_CHAR (you would add this production) so it will match the > "-" that you are using in your field names (and fix it to match any other > characters you want to use in field names that it doesn't allow right now). > > Eric > > -Original Message- > From: Jon Pipitone [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 14, 2003 2:26 PM > To: Lucene Users List > Subject: Re: '-' character not interpreted correctly in field names > > Eric Isakson wrote: > > I just looked at the QueryParser.jj code, your field names > > > > never get processed by the analyzer. It does look like the > > query parser will honor escapes though. I haven't tried > > this, but try a query like "foo\-bar:foo" and have > > > > a look at the QueryParser.jj file for how it handles field > > > > names when parsing your query. > > Hrm.. that's what I had found too. So, you're saying that, other than > escaping dashes, I'd have to change QueryParser.. ? > > I'm not too familiar ju
Re: '-' character not interpreted correctly in field names
On Monday 03 February 2003 07:19, Terry Steichen wrote: > I believe that the tokenizer treats a dash as a token separator. Hence, > the only way, as I recall, to eliminate this behavior is to modify > QueryParser.jj so it doesn't do this. However, doing this can cause some > other problems, like hyphenated words at a line break and the like. It might be enough to just replace analyzer passed in to QueryParser to do this? This is the case if QueryParser only handles modifiers outside terms, and terms are passed to analyzer. I think this is the case (QueryParser does call the analyzer in couple of places, and one word may actually expand to a phrase or vice versa)? Still, it seems like using a hyphen as separator shouldn't necessarily cause big problems when indexer does the same; queries against "2 - 5" would be phrase queries for "2 5", which is still reasonably specific (and should match the content). On the other hand, simple analyzer and standard analyzer have pretty different tokenization rules, so it's important to make sure same analyzer is used for both indexing and searching (that mismatch can prevent matches easily). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: '-' character not interpreted correctly in field names
I believe that the tokenizer treats a dash as a token separator. Hence, the only way, as I recall, to eliminate this behavior is to modify QueryParser.jj so it doesn't do this. However, doing this can cause some other problems, like hyphenated words at a line break and the like. (Of course, if you do make such a change, you'll have to go back and reindex after such a change.) I've run into this problem myself and I've 'punted' - on certain fields, when I index, I replace the dash with an underscore. This isn't a real good solution, and it does require me to keep remembering in which fields I have to do this substitution in the search. But, for the moment it works. I'll probably go back and make some kind of change later, when I have more time. HTH, Terry - Original Message - From: "hermit" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, February 03, 2003 2:39 AM Subject: '-' character not interpreted correctly in field names > Hello! > > I have a problem, a big one. I have successfully indexed 600 MB of XML > data, but the search can't give any results if the field contains any > '-' characters . > For example: compound@cgx-code:[2 - 5] must match at least two results > based on my XML data but it gives nothing. > > Can you advice me a simple solution? Or is it a bug? > > The Hermit > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
'-' character not interpreted correctly in field names
Hello! I have a problem, a big one. I have successfully indexed 600 MB of XML data, but the search can't give any results if the field contains any '-' characters . For example: compound@cgx-code:[2 - 5] must match at least two results based on my XML data but it gives nothing. Can you advice me a simple solution? Or is it a bug? The Hermit - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]