[jira] [Commented] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909145#comment-16909145 ] Itamar Syn-Hershko commented on LUCENE-8565: Heya - is this waiting for anything in particular that I can help in finalizing? Would really like to see this merged in. Thanks > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771783#comment-16771783 ] Itamar Syn-Hershko commented on LUCENE-8565: I'm not sure what the Lucene versioning policy about that would be; but we can always change the default flag to turn off field filtering support > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-8565: --- Summary: SimpleQueryParser to support field filtering (aka Add field:text operator) (was: SimpleQueryString to support field filtering (aka Add field:text operator)) > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryString lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-8565: --- Description: SimpleQueryParser lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. was: SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686301#comment-16686301 ] Itamar Syn-Hershko commented on LUCENE-8565: PR submitted on github: [https://github.com/apache/lucene-solr/pull/498.] Reviews appreciated. > SimpleQueryString to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryString lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-8565: --- Description: SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. was: SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. > SimpleQueryString to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > > SimpleQueryString lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)
Itamar Syn-Hershko created LUCENE-8565: -- Summary: SimpleQueryString to support field filtering (aka Add field:text operator) Key: LUCENE-8565 URL: https://issues.apache.org/jira/browse/LUCENE-8565 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Itamar Syn-Hershko SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6302) Adding Date Math support to Lucene Expressions module
[ https://issues.apache.org/jira/browse/LUCENE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338677#comment-14338677 ] Itamar Syn-Hershko commented on LUCENE-6302: Sent a PR for the latter https://github.com/apache/lucene-solr/pull/129 > Adding Date Math support to Lucene Expressions module > - > > Key: LUCENE-6302 > URL: https://issues.apache.org/jira/browse/LUCENE-6302 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/expressions >Affects Versions: 4.10.3 >Reporter: Itamar Syn-Hershko > > Lucene Expressions are great, but they don't allow for date math. More > specifically, they don't allow to infer date parts from a numeric > representation of a date stamp, nor they allow to parse strings > representations to dates. > Some of the features requested here easy to implement via ValueSource > implementation (and potentially minor changes to the lexer definition) , some > are more involved. I'll be happy if we could get half of those in, and will > be happy to work on a PR for the parts we can agree on. > The items we will be happy to have: > - A now() function (with or without TZ support) to return a current long > date/time value as numeric, that we could use against indexed datetime fields > (which are infact numerics) > - Parsing methods - to allow to express datetime as strings, and / or read it > from stored fields and parse it from there. Parse errors would render a value > of zero. > - Given a numeric value, allow to specify it is a date value and then infer > date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - > Date(1424963520).Year. Basically methods which return numerics but internally > create and use Date objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6302) Adding Date Math support to Lucene Expressions module
[ https://issues.apache.org/jira/browse/LUCENE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338563#comment-14338563 ] Itamar Syn-Hershko commented on LUCENE-6302: I actually expected the main objection would be to adding date parsing methods :) Maybe it would make sense to explain the use cases this is trying to solve. We are using Elasticsearch & Kibana and since the latest version switched to using Lucene Expressions (from Groovy) we found ourselves blocked by the things we can do with Kibana's scripted fields For example, given a user's DOB, how can we do aggregations on their age? or compute how many years (or days) have passed between 2 given days? Yes we can subtract the epochs (except that it doesn't seem to work https://github.com/elasticsearch/elasticsearch/issues/9884) but translating the result to terms of days, hours or years is even uglier using an expression. I think introducing ValueSources to do this should be enough, but if changing the lexer will be the preferred way I can go and do that as well. With regards to syntax - I'm not locked on any preferred syntax. Either way it seems like adding a now() function is the easiest change and can send a PR with this change alone to start with. > Adding Date Math support to Lucene Expressions module > - > > Key: LUCENE-6302 > URL: https://issues.apache.org/jira/browse/LUCENE-6302 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/expressions >Affects Versions: 4.10.3 >Reporter: Itamar Syn-Hershko > > Lucene Expressions are great, but they don't allow for date math. More > specifically, they don't allow to infer date parts from a numeric > representation of a date stamp, nor they allow to parse strings > representations to dates. > Some of the features requested here easy to implement via ValueSource > implementation (and potentially minor changes to the lexer definition) , some > are more involved. I'll be happy if we could get half of those in, and will > be happy to work on a PR for the parts we can agree on. > The items we will be happy to have: > - A now() function (with or without TZ support) to return a current long > date/time value as numeric, that we could use against indexed datetime fields > (which are infact numerics) > - Parsing methods - to allow to express datetime as strings, and / or read it > from stored fields and parse it from there. Parse errors would render a value > of zero. > - Given a numeric value, allow to specify it is a date value and then infer > date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - > Date(1424963520).Year. Basically methods which return numerics but internally > create and use Date objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6302) Adding Date Math support to Lucene Expressions module
Itamar Syn-Hershko created LUCENE-6302: -- Summary: Adding Date Math support to Lucene Expressions module Key: LUCENE-6302 URL: https://issues.apache.org/jira/browse/LUCENE-6302 Project: Lucene - Core Issue Type: Improvement Components: modules/expressions Affects Versions: 4.10.3 Reporter: Itamar Syn-Hershko Lucene Expressions are great, but they don't allow for date math. More specifically, they don't allow to infer date parts from a numeric representation of a date stamp, nor they allow to parse strings representations to dates. Some of the features requested here easy to implement via ValueSource implementation (and potentially minor changes to the lexer definition) , some are more involved. I'll be happy if we could get half of those in, and will be happy to work on a PR for the parts we can agree on. The items we will be happy to have: - A now() function (with or without TZ support) to return a current long date/time value as numeric, that we could use against indexed datetime fields (which are infact numerics) - Parsing methods - to allow to express datetime as strings, and / or read it from stored fields and parse it from there. Parse errors would render a value of zero. - Given a numeric value, allow to specify it is a date value and then infer date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - Date(1424963520).Year. Basically methods which return numerics but internally create and use Date objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241306#comment-14241306 ] Itamar Syn-Hershko commented on LUCENE-6103: Sent them a request. I'll buy Robert beers if that could help pushing this forward! > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko >Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241214#comment-14241214 ] Itamar Syn-Hershko commented on LUCENE-6103: Maybe out of scope of this ticket, but how do we go about #2? will be happy to take this discussion offline as well > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko >Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240392#comment-14240392 ] Itamar Syn-Hershko commented on LUCENE-6103: 0. You mean it implements UAX#29 version 6.3 :) 1. I'll likely be sending a PR for #1 sometime soon. Would you recommend using UAX#29 minus specific non-English tweaks, or fall back to ClassicStandardTokenizer which is English specific, or something else? 2. Here's the thing: the standard is wrong, or buggy. Ask any Swedish and they will tell you, and any non-Swedish corpus wouldn't care. And basically this is a bug in every Lucene based system today because of the word:word scenario; its a bit of an edge case but I bet I can find multiple occurrences in every big enough system. What can we do about that? We already solved this using char filters, converting colons to a comma. It feels a bit hacky though, and again - this _is_ a flaw in Lucene's analysis even though it conforms to a Unicode standard. > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko >Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240133#comment-14240133 ] Itamar Syn-Hershko commented on LUCENE-6103: Ok so I did some homework. In swedish, "connect" is a way to shortcut writings of words. So "C:a" is infact "cirka" which means "approximately". I guess it can be thought of as English acronyms, only apparently its way less commonly used in Swedish (my source says "very very seldomly used; old style and not used in modern Swedish at all"). Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a but not c:ka). And also, the affects it has are quite severe at the moment - 2 words with a colon in between that didn't have space will be outputted as one token even though its 100% its not applicable to Swedish, since each words has > 2 characters. I'm not aiming at changing the Unicode standards, that's way beyond my limited powers, but: 1. Given the above, does it really make sense to use this tokenizer in all language-specific analyzers as well? e.g. https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105 I'd think for language specific analyzers we'd want tokenizers aiming for this language with limited support for others. So, in this case, colon will always be considered a tokenizing char. 2. Can we change the jflex definition to at least limit the effects of this, e.g. only support colon as MidLetter if the overall token length == 3, so c:a is a valid token and word:word is not? > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko >Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240090#comment-14240090 ] Itamar Syn-Hershko commented on LUCENE-6103: Good stuff, thanks Steve. I'm going to see if the rest of the UAX is good for us, and if so see if I can comply with the 6.2.5 version of the specs. It's a good thing StandardTokenizer is no longer English centric, but I cannot imagine what use the colon has especially since in most cases it is not "something reasonable" :) > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko >Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239784#comment-14239784 ] Itamar Syn-Hershko commented on LUCENE-6103: Yes, I figured it will be down to some Unicode rules. Can you give a rationale for this, mainly out of curiosity? Not a Unicode expert, but I'd assume just like you wouldn't want English words to not-break on Hebrew Punctuation Gershayim (e.g. Test"Word is actually 2 tokens and מנכ"לים is one), maybe this rule is meant for specific scenarios and not for the general use case? On another note, any type of Gershayim should be preserved within Hebrew words, not only U+05F4. This is mainly because keyboards and editors used produce the standard " character in most cases. I had a chat with Robert a while back where he said that's the case, I'm just making sure you didn't follow the specs to the letter in that regard... > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko >Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5723) Performance improvements for FastCharStream
[ https://issues.apache.org/jira/browse/LUCENE-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239728#comment-14239728 ] Itamar Syn-Hershko commented on LUCENE-5723: Reported as https://java.net/jira/browse/JAVACC-285 > Performance improvements for FastCharStream > --- > > Key: LUCENE-5723 > URL: https://issues.apache.org/jira/browse/LUCENE-5723 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser >Reporter: Itamar Syn-Hershko >Priority: Minor > > Hello from the .NET land, > A user of ours has identified an optimization opportunity, although minor I > think it points to a valid point - we should avoid using exceptions from > controlling flow when possible. > Here's the original ticket + commits to our codebase. If this looks valid to > you too I can go ahead and prepare a PR. > https://issues.apache.org/jira/browse/LUCENENET-541 > https://github.com/apache/lucene.net/commit/ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 > https://git-wip-us.apache.org/repos/asf?p=lucenenet.git;a=blobdiff;f=src/core/QueryParser/QueryParserTokenManager.cs;h=ec09c8e451f7a7d1572fbdce4c7598e362526a7c;hp=17583d20f660fdb6e4aa86105c7574383f965ebe;hb=41ebbc2d;hpb=ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5997) StandardFilter redundant
[ https://issues.apache.org/jira/browse/LUCENE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239697#comment-14239697 ] Itamar Syn-Hershko commented on LUCENE-5997: Sounds good! > StandardFilter redundant > > > Key: LUCENE-5997 > URL: https://issues.apache.org/jira/browse/LUCENE-5997 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.10.1 >Reporter: Itamar Syn-Hershko >Priority: Trivial > > Any reason why StandardFilter is still around? its just a no-op class now: > @Override > public final boolean incrementToken() throws IOException { > return input.incrementToken(); // TODO: add some niceties for the new > grammar > } > https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-6103: --- Summary: StandardTokenizer doesn't tokenize word:word (was: StandardTokenizer doesn't tokenizer word:word) > StandardTokenizer doesn't tokenize word:word > > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.9 >Reporter: Itamar Syn-Hershko > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6103) StandardTokenizer doesn't tokenizer word:word
Itamar Syn-Hershko created LUCENE-6103: -- Summary: StandardTokenizer doesn't tokenizer word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standard&text=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5997) StandardFilter redundant
Itamar Syn-Hershko created LUCENE-5997: -- Summary: StandardFilter redundant Key: LUCENE-5997 URL: https://issues.apache.org/jira/browse/LUCENE-5997 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.10.1 Reporter: Itamar Syn-Hershko Priority: Trivial Any reason why StandardFilter is still around? its just a no-op class now: @Override public final boolean incrementToken() throws IOException { return input.incrementToken(); // TODO: add some niceties for the new grammar } https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035978#comment-14035978 ] Itamar Syn-Hershko commented on LUCENE-2841: Can anyone review and comment? > CommonGramsFilter improvements > -- > > Key: LUCENE-2841 > URL: https://issues.apache.org/jira/browse/LUCENE-2841 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 3.1, 4.0-ALPHA >Reporter: Steve Rowe >Priority: Minor > Fix For: 4.9, 5.0 > > Attachments: commit-6402a55.patch > > > Currently CommonGramsFilter expects users to remove the common words around > which output token ngrams are formed, by appending a StopFilter to the > analysis pipeline. This is inefficient in two ways: captureState() is called > on (trailing) stopwords, and then the whole stream has to be re-examined by > the following StopFilter. > The current ctor should be deprecated, and another ctor added with a boolean > option controlling whether the common words should be output as unigrams. > If common words *are* configured to be output as unigrams, captureState() > will still need to be called, as it is now. > If the common words are *not* configured to be output as unigrams, rather > than calling captureState() for the trailing token in each output token > ngram, the term text, position and offset can be maintained in the same way > as they are now for the leading token: using a System.arrayCopy()'d term > buffer and a few ints for positionIncrement and offsetd. The user then no > longer would need to append a StopFilter to the analysis chain. > An example illustrating both possibilities should also be added. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4601) ivy availability check isn't quite right
[ https://issues.apache.org/jira/browse/LUCENE-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035885#comment-14035885 ] Itamar Syn-Hershko commented on LUCENE-4601: May not be directly related, but I just tried running this: http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ on OSX Mavericks, with ant and ivy both installed via homebrew. Ivy was not found by and idea even when I placed a manually downloaded jar locally myself. I had to run ivy-bootstrap to get off the ground - maybe it worths adding that to the docs > ivy availability check isn't quite right > > > Key: LUCENE-4601 > URL: https://issues.apache.org/jira/browse/LUCENE-4601 > Project: Lucene - Core > Issue Type: Bug > Components: general/build >Reporter: Robert Muir > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4601.patch > > > remove ivy from your .ant/lib but load it up on a build file like so: > You have to lie to lucene's build, overriding ivy.available, because for some > reason the detection is wrong and will tell you ivy is not available, when it > actually is. > I tried changing the detector to use available classname=some.ivy.class and > this didnt work either... so I don't actually know what the correct fix is. > {noformat} > > > > >uri="antlib:org.apache.ivy.ant" classpathref="ivy.lib.path" /> > > failonerror="true"> > > > > > > > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5723) Performance improvements for FastCharStream
Itamar Syn-Hershko created LUCENE-5723: -- Summary: Performance improvements for FastCharStream Key: LUCENE-5723 URL: https://issues.apache.org/jira/browse/LUCENE-5723 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Itamar Syn-Hershko Priority: Minor Hello from the .NET land, A user of ours has identified an optimization opportunity, although minor I think it points to a valid point - we should avoid using exceptions from controlling flow when possible. Here's the original ticket + commits to our codebase. If this looks valid to you too I can go ahead and prepare a PR. https://issues.apache.org/jira/browse/LUCENENET-541 https://github.com/apache/lucene.net/commit/ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 https://git-wip-us.apache.org/repos/asf?p=lucenenet.git;a=blobdiff;f=src/core/QueryParser/QueryParserTokenManager.cs;h=ec09c8e451f7a7d1572fbdce4c7598e362526a7c;hp=17583d20f660fdb6e4aa86105c7574383f965ebe;hb=41ebbc2d;hpb=ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5358) Code cleanup on KStemmer
Itamar Syn-Hershko created LUCENE-5358: -- Summary: Code cleanup on KStemmer Key: LUCENE-5358 URL: https://issues.apache.org/jira/browse/LUCENE-5358 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.6, 4.5.1, 4.5, 3.0 Reporter: Itamar Syn-Hershko Priority: Minor This affects all versions with KStemmer in them The code of KStemmer needs some intensive cleanup, just to give you some idea on something that immediately popped up: https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/KStemmer.java#L283-286 I'll be happy to do this myself, just wanted to check in advance to see if this is something you'd consider accepting in -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5011) MemoryIndex and FVH don't play along with multi-value fields
[ https://issues.apache.org/jira/browse/LUCENE-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662950#comment-13662950 ] Itamar Syn-Hershko commented on LUCENE-5011: The actual test case we have now is very tightly coupled with ElasticSearch and our custom analysis chain, it may take me some time to decouple it into a stand-alone Lucene test. Alternatively, I'll be happy to work this out with you via Skype using our existing test case. > MemoryIndex and FVH don't play along with multi-value fields > > > Key: LUCENE-5011 > URL: https://issues.apache.org/jira/browse/LUCENE-5011 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.3 >Reporter: Itamar Syn-Hershko > > When multi-value fields are indexed to a MemoryIndex, positions are computed > correctly on search but the start and end offsets and the values array index > aren't correct. > Comparing the same execution path for IndexReader on a Directory impl and > MemoryIndex (same document, same query, same analyzer, different Index impl), > the difference first shows in FieldTermStack.java line 125: > termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), > pos, weight ) ); > dpEnum.startOffset() and dpEnum.endOffset don't match between implementations. > This looks like a bug in MemoryIndex, which doesn't seem to handle tokenized > multi-value fields all too well when positions and offsets are required. > I should also mention we are using an Analyzer which outputs several tokens > at a position (a la SynonymFilter), but I don't believe this is related. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5011) MemoryIndex and FVH don't play along with multi-value fields
Itamar Syn-Hershko created LUCENE-5011: -- Summary: MemoryIndex and FVH don't play along with multi-value fields Key: LUCENE-5011 URL: https://issues.apache.org/jira/browse/LUCENE-5011 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Itamar Syn-Hershko When multi-value fields are indexed to a MemoryIndex, positions are computed correctly on search but the start and end offsets and the values array index aren't correct. Comparing the same execution path for IndexReader on a Directory impl and MemoryIndex (same document, same query, same analyzer, different Index impl), the difference first shows in FieldTermStack.java line 125: termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), pos, weight ) ); dpEnum.startOffset() and dpEnum.endOffset don't match between implementations. This looks like a bug in MemoryIndex, which doesn't seem to handle tokenized multi-value fields all too well when positions and offsets are required. I should also mention we are using an Analyzer which outputs several tokens at a position (a la SynonymFilter), but I don't believe this is related. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4673) TermQuery.toString() doesn't play nicely with whitespace
[ https://issues.apache.org/jira/browse/LUCENE-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548874#comment-13548874 ] Itamar Syn-Hershko commented on LUCENE-4673: I figured as much, yet we would definitely like to have use this behavior built-in. Are there any plans on making such an interface to perform a proper Query -> String conversion? > TermQuery.toString() doesn't play nicely with whitespace > > > Key: LUCENE-4673 > URL: https://issues.apache.org/jira/browse/LUCENE-4673 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.0-BETA, 4.1, 3.6.2 >Reporter: Itamar Syn-Hershko > > A TermQuery where term.text() contains whitespace outputs incorrect string > representation: field:foo bar instead of field:"foo bar". A "correct" > representation is such that could be parsed again to the correct Query object > (using the correct analyzer, yes, but still). > This may not be so critical, but in our system we use Lucene's QP to parse > and then pre-process and optimize user queries. To do that we use > Query.toString on some clauses to rebuild the query string. > This can be easily resolved by always adding quote marks before and after the > term text in TermQuery.toString. Testing to see if they are required or not > is too much work and TermQuery is ignorant of quote marks anyway. > Some other scenarios which could benefit from this change is places where > escaped characters are used, such as URLs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4673) TermQuery.toString() doesn't play nicely with whitespace
Itamar Syn-Hershko created LUCENE-4673: -- Summary: TermQuery.toString() doesn't play nicely with whitespace Key: LUCENE-4673 URL: https://issues.apache.org/jira/browse/LUCENE-4673 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 3.6.2, 4.0-BETA, 4.1 Reporter: Itamar Syn-Hershko A TermQuery where term.text() contains whitespace outputs incorrect string representation: field:foo bar instead of field:"foo bar". A "correct" representation is such that could be parsed again to the correct Query object (using the correct analyzer, yes, but still). This may not be so critical, but in our system we use Lucene's QP to parse and then pre-process and optimize user queries. To do that we use Query.toString on some clauses to rebuild the query string. This can be easily resolved by always adding quote marks before and after the term text in TermQuery.toString. Testing to see if they are required or not is too much work and TermQuery is ignorant of quote marks anyway. Some other scenarios which could benefit from this change is places where escaped characters are used, such as URLs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539310#comment-13539310 ] Itamar Syn-Hershko commented on LUCENE-2841: Attached is a patch to fix this, including tests. There is no regression, and the new behavior when keepOrig is set to true is as described in the comments here. The only thing I wasn't sure about was CommonGramsQueryFilter - should it be deprecated? or how should it be made to work with this change? > CommonGramsFilter improvements > -- > > Key: LUCENE-2841 > URL: https://issues.apache.org/jira/browse/LUCENE-2841 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 3.1, 4.0-ALPHA >Reporter: Steven Rowe >Priority: Minor > Fix For: 4.1 > > Attachments: commit-6402a55.patch > > > Currently CommonGramsFilter expects users to remove the common words around > which output token ngrams are formed, by appending a StopFilter to the > analysis pipeline. This is inefficient in two ways: captureState() is called > on (trailing) stopwords, and then the whole stream has to be re-examined by > the following StopFilter. > The current ctor should be deprecated, and another ctor added with a boolean > option controlling whether the common words should be output as unigrams. > If common words *are* configured to be output as unigrams, captureState() > will still need to be called, as it is now. > If the common words are *not* configured to be output as unigrams, rather > than calling captureState() for the trailing token in each output token > ngram, the term text, position and offset can be maintained in the same way > as they are now for the leading token: using a System.arrayCopy()'d term > buffer and a few ints for positionIncrement and offsetd. The user then no > longer would need to append a StopFilter to the analysis chain. > An example illustrating both possibilities should also be added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2841) CommonGramsFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-2841: --- Attachment: commit-6402a55.patch Adding option to CommonGramsFilter to not unigram common words > CommonGramsFilter improvements > -- > > Key: LUCENE-2841 > URL: https://issues.apache.org/jira/browse/LUCENE-2841 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 3.1, 4.0-ALPHA >Reporter: Steven Rowe >Priority: Minor > Fix For: 4.1 > > Attachments: commit-6402a55.patch > > > Currently CommonGramsFilter expects users to remove the common words around > which output token ngrams are formed, by appending a StopFilter to the > analysis pipeline. This is inefficient in two ways: captureState() is called > on (trailing) stopwords, and then the whole stream has to be re-examined by > the following StopFilter. > The current ctor should be deprecated, and another ctor added with a boolean > option controlling whether the common words should be output as unigrams. > If common words *are* configured to be output as unigrams, captureState() > will still need to be called, as it is now. > If the common words are *not* configured to be output as unigrams, rather > than calling captureState() for the trailing token in each output token > ngram, the term text, position and offset can be maintained in the same way > as they are now for the leading token: using a System.arrayCopy()'d term > buffer and a few ints for positionIncrement and offsetd. The user then no > longer would need to append a StopFilter to the analysis chain. > An example illustrating both possibilities should also be added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4208) Spatial distance relevancy should use score of 1/distance
[ https://issues.apache.org/jira/browse/LUCENE-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451430#comment-13451430 ] Itamar Syn-Hershko commented on LUCENE-4208: What's the status of this? are query results being properly sorted based on distance? > Spatial distance relevancy should use score of 1/distance > - > > Key: LUCENE-4208 > URL: https://issues.apache.org/jira/browse/LUCENE-4208 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/spatial >Reporter: David Smiley > Fix For: 4.0 > > > The SpatialStrategy.makeQuery() at the moment uses the distance as the score > (although some strategies -- TwoDoubles if I recall might not do anything > which would be a bug). The distance is a poor value to use as the score > because the score should be related to relevancy, and the distance itself is > inversely related to that. A score of 1/distance would be nice. Another > alternative is earthCircumference/2 - distance, although I like 1/distance > better. Maybe use a different constant than 1. > Credit: this is Chris Male's idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4186) Lucene spatial's "distErrPct" is treated as a fraction, not a percent.
[ https://issues.apache.org/jira/browse/LUCENE-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447037#comment-13447037 ] Itamar Syn-Hershko commented on LUCENE-4186: distErrPct makes sense to me - it makes more sense to talk about the expected error rate rather than actual given precision. Hence the name "Distance Error Percentage" makes perfect sense, although is tough to make an acronym of... And while at it throw a bug fix in: SpatialArgs.toString should multiply distPrecision by 100, not divide it. > Lucene spatial's "distErrPct" is treated as a fraction, not a percent. > -- > > Key: LUCENE-4186 > URL: https://issues.apache.org/jira/browse/LUCENE-4186 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial >Reporter: David Smiley >Assignee: David Smiley >Priority: Critical > Fix For: 4.0 > > > The distance-error-percent of a query shape in Lucene spatial is, in a > nutshell, the percent of the shape's area that is an error epsilon when > considering search detail at its edges. The default is 2.5%, for reference. > However, as configured, it is read in as a fraction: > {code:xml} > class="solr.SpatialRecursivePrefixTreeFieldType" >distErrPct="0.025" maxDetailDist="0.001" /> > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage
[ https://issues.apache.org/jira/browse/LUCENE-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445807#comment-13445807 ] Itamar Syn-Hershko commented on LUCENE-4342: I can confirm this is fixed now. Thanks for the fast turnaround! > Issues with prefix tree's Distance Error Percentage > > > Key: LUCENE-4342 > URL: https://issues.apache.org/jira/browse/LUCENE-4342 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial >Affects Versions: 4.0-ALPHA, 4.0-BETA >Reporter: Itamar Syn-Hershko >Assignee: David Smiley > Fix For: 4.0 > > Attachments: > LUCENE-4342_fix_distance_precision_lookup_for_prefix_trees,_and_modify_the_default_algorit.patch, > unnamed.patch > > > See attached patch for a failing test > Basically, it's a simple point and radius scenario that works great as long > as args.setDistPrecision(0.0); is called. Once the default precision is used > (2.5%), it doesn't work as expected. > The distance between the 2 points in the patch is 35.75 KM. Taking into > account the 2.5% error the effective radius without false negatives/positives > should be around 34.8 KM. This test fails with a radius of 33 KM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage
[ https://issues.apache.org/jira/browse/LUCENE-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-4342: --- Attachment: unnamed.patch A failing test > Issues with prefix tree's Distance Error Percentage > > > Key: LUCENE-4342 > URL: https://issues.apache.org/jira/browse/LUCENE-4342 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial >Affects Versions: 4.0-ALPHA, 4.0-BETA >Reporter: Itamar Syn-Hershko > Attachments: unnamed.patch > > > See attached patch for a failing test > Basically, it's a simple point and radius scenario that works great as long > as args.setDistPrecision(0.0); is called. Once the default precision is used > (2.5%), it doesn't work as expected. > The distance between the 2 points in the patch is 35.75 KM. Taking into > account the 2.5% error the effective radius without false negatives/positives > should be around 34.8 KM. This test fails with a radius of 33 KM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage
Itamar Syn-Hershko created LUCENE-4342: -- Summary: Issues with prefix tree's Distance Error Percentage Key: LUCENE-4342 URL: https://issues.apache.org/jira/browse/LUCENE-4342 Project: Lucene - Core Issue Type: Bug Components: modules/spatial Affects Versions: 4.0-BETA, 4.0-ALPHA Reporter: Itamar Syn-Hershko Attachments: unnamed.patch See attached patch for a failing test Basically, it's a simple point and radius scenario that works great as long as args.setDistPrecision(0.0); is called. Once the default precision is used (2.5%), it doesn't work as expected. The distance between the 2 points in the patch is 35.75 KM. Taking into account the 2.5% error the effective radius without false negatives/positives should be around 34.8 KM. This test fails with a radius of 33 KM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENENET-483) Spatial Search skipping records when one location is close to origin, another one is away and radius is wider
[ https://issues.apache.org/jira/browse/LUCENENET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280179#comment-13280179 ] Itamar Syn-Hershko commented on LUCENENET-483: -- Here is a passing test https://github.com/synhershko/lucene.net/commit/41e745a2aff596f3f7b0e2842a7b5fa7b45d88d3 You can grab a compiled version of Spatial4n.core and Lucene.Net.Contrib.Spatial.dll from https://github.com/synhershko/ravendb/tree/spatial/SharedLibs > Spatial Search skipping records when one location is close to origin, another > one is away and radius is wider > - > > Key: LUCENENET-483 > URL: https://issues.apache.org/jira/browse/LUCENENET-483 > Project: Lucene.Net > Issue Type: Bug > Components: Lucene.Net Contrib >Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g > Environment: .Net framework 4.0 >Reporter: Aleksandar Panov > Labels: lucene, spatialsearch > Fix For: Lucene.Net 3.0.3 > > > Running a spatial query against two locations where one location is close to > origin (less than a mile), another one is away (24 miles) and radius is wider > (52 miles) returns only one result. Running query with a bit wider radius > (53.8) returns 2 results. > IMPORTANT UPDATE: Problem can't be reproduced in Java, with using original > Lucene.Spatial (2.9.4 version) library. > {code} > // Origin > private double _lat = 42.350153; > private double _lng = -71.061667; > private const string LatField = "lat"; > private const string LngField = "lng"; > //Locations > AddPoint(writer, "Location 1", 42.0, -71.0); //24 miles away from > origin > AddPoint(writer, "Location 2", 42.35, -71.06); //less than a mile > [TestMethod] > public void TestAntiM() > { > _directory = new RAMDirectory(); > var writer = new IndexWriter(_directory, new > WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); > SetUpPlotter(2, 15); > AddData(writer); > _searcher = new IndexSearcher(_directory, true); > //const double miles = 53.8; // Correct. Returns 2 Locations. > const double miles = 52; // Incorrect. Returns 1 Location. > Console.WriteLine("testAntiM"); > // create a distance query > var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, > LngField, CartesianTierPlotter.DefaltFieldPrefix, true); > Console.WriteLine(dq); > //create a term query to search against all documents > Query tq = new TermQuery(new Term("metafile", "doc")); > var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter); > Sort sort = new Sort(new SortField("foo", dsort, false)); > // Perform the search, using the term query, the distance filter, > and the > // distance sort > TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort); > int results = hits.TotalHits; > ScoreDoc[] scoreDocs = hits.ScoreDocs; > // Get a list of distances > Dictionary distances = dq.DistanceFilter.Distances; > Console.WriteLine("Distance Filter filtered: " + distances.Count); > Console.WriteLine("Results: " + results); > Console.WriteLine("="); > Console.WriteLine("Distances should be 2 " + distances.Count); > Console.WriteLine("Results should be 2 " + results); > Assert.AreEqual(2, distances.Count); // fixed a store of only > needed distances > Assert.AreEqual(2, results); > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SOLR-3304) Add Solr support for the new Lucene spatial module
[ https://issues.apache.org/jira/browse/SOLR-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279609#comment-13279609 ] Itamar Syn-Hershko commented on SOLR-3304: -- In continuation to the discussion on the spatial4j list, +1 for having all the tests with actual spatial logic reside in the Lucene spatial module, and have the Solr tests rely on that > Add Solr support for the new Lucene spatial module > -- > > Key: SOLR-3304 > URL: https://issues.apache.org/jira/browse/SOLR-3304 > Project: Solr > Issue Type: New Feature >Affects Versions: 4.0 >Reporter: Bill Bell >Assignee: David Smiley > Labels: spatial > Attachments: SOLR-3304_Solr_fields_for_Lucene_spatial_module.patch > > > Get the Solr spatial module integrated with the lucene spatial module. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[Lucene.Net] [jira] [Commented] (LUCENENET-407) Signing the assembly
[ https://issues.apache.org/jira/browse/LUCENENET-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090201#comment-13090201 ] Itamar Syn-Hershko commented on LUCENENET-407: -- Hmm... I just looked around the branches and couldn't see this committed anywhere. Ideas? > Signing the assembly > > > Key: LUCENENET-407 > URL: https://issues.apache.org/jira/browse/LUCENENET-407 > Project: Lucene.Net > Issue Type: Improvement > Components: Lucene.Net Core >Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x >Reporter: Itamar Syn-Hershko > Fix For: Lucene.Net 2.9.4, Lucene.Net 3.x > > Attachments: Lucene.NET.snk, signing.patch > > > For our usage of Lucene.NET we need the assembly to be signed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Commented] (LUCENENET-426) Mark BaseFragmentsBuilder methods as virtual
[ https://issues.apache.org/jira/browse/LUCENENET-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053511#comment-13053511 ] Itamar Syn-Hershko commented on LUCENENET-426: -- Apparently that was not enough. I hit a need to override this one too: protected Field[] GetFields(IndexReader reader, int docId, String fieldName) Perhaps it'd make sense to make all protected virtual? In Java you can override anything that is not final, so that would be compatible with the original version. > Mark BaseFragmentsBuilder methods as virtual > > > Key: LUCENENET-426 > URL: https://issues.apache.org/jira/browse/LUCENENET-426 > Project: Lucene.Net > Issue Type: Improvement > Components: Lucene.Net Contrib >Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x, > Lucene.Net 2.9.4g >Reporter: Itamar Syn-Hershko >Priority: Minor > Fix For: Lucene.Net 2.9.4, Lucene.Net 2.9.4g > > Attachments: fvh.patch > > > Without marking methods in BaseFragmentsBuilder as virtual, it is meaningless > to have FragmentsBuilder deriving from a class named "Base", since most of > its functionality cannot be overridden. Attached is a patch for marking the > important methods virtual. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032359#comment-13032359 ] Itamar Syn-Hershko commented on LUCENE-2215: Thanks. I ended up using the standard Lucene paging code. Hopefully this will get into Lucene soon... > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022372#comment-13022372 ] Itamar Syn-Hershko commented on LUCENE-2215: Hi guys, any update on this? I'm interested in using this for production code. Can anyone comment on how safe / mature this code is? Thanks! > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2518) Make check of BooleanClause.Occur[] in MultiFieldQueryParser.parse less stubborn
Make check of BooleanClause.Occur[] in MultiFieldQueryParser.parse less stubborn Key: LUCENE-2518 URL: https://issues.apache.org/jira/browse/LUCENE-2518 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 3.0.2, 3.0.1, 3.0, 2.9.3, 2.9.2, 2.9.1, 2.9 Reporter: Itamar Syn-Hershko Priority: Minor Update the check in: public static Query parse(Version matchVersion, String query, String[] fields, BooleanClause.Occur[] flags, Analyzer analyzer) throws ParseException { if (fields.length != flags.length) throw new IllegalArgumentException("fields.length != flags.length"); To be: if (fields.length > flags.length) So the consumer can use one Occur array and apply fields selectively. The only danger here is with hitting a non-existent cell in flags, and this check will provide this just as well without limiting usability for such cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word
[ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868183#action_12868183 ] Itamar Syn-Hershko commented on LUCENE-2465: bq. This is why i say, the only solution is to follow unicode. Adding hacks like this will only break other languages. Problem is, Hebrew parsing has been broken for a long time now, and this still needs fixing. I don't think you should be forcing extra pre-handling for Hebrew or Bengali (or other) queries, just to keep CJK parsing working out of the box. Escaping those cases by the caller is a much more complex operation than a normal escape you'd do on your queries. For languages where a colon is being used as a character, if indeed the use case is the same as mid-word gershayim (i.e. there's no key for that letter and it is more of a letter than a punctuation char), the issue with the QP is the same. If the solution I had proposed initially wouldn't have caused other issues with CJK phrases, I'd insist on it. However, you are obviously right this change would break functionality for those languages, but you are wrong claiming it is not up to the query parser to resolve. As Shai have already pointed out, the QP should parse based on syntax with the smallest hassle to the consumer. Obviously, a solution has to be provided, and it is agreed it should not affect the variety of supported languages. How about creating this functionality and leaving it as optional? for CJK you'd leave it off, while for all other languages (English and European) you could turn it on and feel no difference at the worse case scenario. Or, you could have this setting accessible from your Analyzer. Analyzers are defining the core's behavior per-language, and as such it would make sense to make the QP check with the analyzer which cases are a syntax error and which aren't. > QueryParser should ignore double-quotes if mid-word > --- > > Key: LUCENE-2465 > URL: https://issues.apache.org/jira/browse/LUCENE-2465 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, > 4.0 >Reporter: Itamar Syn-Hershko > > Current implementation of Lucene's QueryParser identifies a phrase in the > query when hitting a double-quotes char, even if it is mid-word. For example, > the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term > and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase > is a group of words surrounded by double quotes as defined by > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does > it say double-quotes will also tokenize the input. Arguably, a phrase should > only be identified as such when it is also surrounded by whitespaces. > Other than a logically incorrect behavior, this makes parsing of Hebrew > acronyms impossible. Hebrew acronyms contain one double-quotes char in the > middle of a word (for example, MNK"L), hence causing the QP to throw a syntax > exception, since it is expecting another double-quotes to create a phrase > query, essentially splitting the acronym into two. > The solution to this is pretty simple - changing the JavaCC syntax to check > if a whitespace precedes the double-quote when a phrase opening is expected, > or peek to see if a whitespace follows the double-quotes if a phrase closing > is expected. > This will both eliminate a logically incorrect behavior which shouldn't be > relied on anyway, and allow Hebrew queries to be correctly parsed also when > containing acronyms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word
[ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867993#action_12867993 ] Itamar Syn-Hershko commented on LUCENE-2465: My point exactly - no one uses that character, and it will require a double pass on the string *always*. I pretty much have rest my case already, and it would have been clearer to you if you have been reading the language. Isn't Google treating those chars the same, or Wikipedia using just double-quotes, proof enough to my argument that double-quotes are allowed to be mid-word, that they 99.9% of the time are used that way, and that this isn't an incorrect behavior? For Hebrew or other multi-lingual systems this will require always preparing the string before calling parse(), and this is definitely an unwanted behavior. Since the solution is *that* simple and non-breaking, I don't see why not just fix it - bug or not. Any other opinions on the matter? > QueryParser should ignore double-quotes if mid-word > --- > > Key: LUCENE-2465 > URL: https://issues.apache.org/jira/browse/LUCENE-2465 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, > 4.0 >Reporter: Itamar Syn-Hershko > > Current implementation of Lucene's QueryParser identifies a phrase in the > query when hitting a double-quotes char, even if it is mid-word. For example, > the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term > and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase > is a group of words surrounded by double quotes as defined by > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does > it say double-quotes will also tokenize the input. Arguably, a phrase should > only be identified as such when it is also surrounded by whitespaces. > Other than a logically incorrect behavior, this makes parsing of Hebrew > acronyms impossible. Hebrew acronyms contain one double-quotes char in the > middle of a word (for example, MNK"L), hence causing the QP to throw a syntax > exception, since it is expecting another double-quotes to create a phrase > query, essentially splitting the acronym into two. > The solution to this is pretty simple - changing the JavaCC syntax to check > if a whitespace precedes the double-quote when a phrase opening is expected, > or peek to see if a whitespace follows the double-quotes if a phrase closing > is expected. > This will both eliminate a logically incorrect behavior which shouldn't be > relied on anyway, and allow Hebrew queries to be correctly parsed also when > containing acronyms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word
[ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867982#action_12867982 ] Itamar Syn-Hershko commented on LUCENE-2465: Using QueryParser.escape() is not an option, since by that I practically prevent the QP from ever returning PhraseQuery's on user queries (it just replaces all occurrences of a QP syntax char). Your other suggestion of using the "correct" Unicode char GERSHAYIM is not doable, because we are talking about user-typed queries here, and no user has such a character on his keyboard. In 99.9% of Hebrew text files, old and new, double-quotes is being used as GERSHAYIM. Only exceptions are when an automated program has automatically converted the mid-word instance of double-quotes into U+05F4. This is pretty much like asking the Lucene community to type U+201C and U+201D (left / right double quotation marks) around phrases or they won't be recognized as such. Because no one has those characters easily accessible from their k/b (to the best of my knowledge), and it doesn't really matter anyway what you type, this thought never passed in anyone's mind. Exactly the same goes for Hebrew. The only doable workaround is to go through the query string before sending it to the QP, and resolve this by either escaping mid-word double-quotes or replacing them with U+05F4. Since most Hebrew dictionaries work with double-quotes for acronyms anyway, escaping it seems much better, but then I ask again - why bother with a double-pass on the query string if a simple change to the QP can resolve that? The effect the behavior has on non-Hebrew scripts is flawed anyway, and there's no reason to require such a pass for Hebrew consumers only (imagine what it'd be like to write a multi-lingual search interface with this issue in mind). As a reference, see how Google and Wikipedia treat Hebrew acronyms: http://www.google.com/#hl=en&source=hp&q=%D7%9E%D7%A0%D7%9B%22%D7%9C&aq=f&aqi=&aql=&oq=&gs_rfai=&fp=d059ab474882bfe2 http://he.wikipedia.org/wiki/%D7%9E%D7%A0%D7%9B%22%D7%9C Google recognizes both double-quotes and GERSHAYIM as correct forms of Hebrew acronyms, while Wikipedia only uses the former in all acronyms. Robert, I hear what you are saying, but this just ain't right when it comes to usability, when the resolution is so simple and doesn't break anything. > QueryParser should ignore double-quotes if mid-word > --- > > Key: LUCENE-2465 > URL: https://issues.apache.org/jira/browse/LUCENE-2465 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, > 4.0 >Reporter: Itamar Syn-Hershko > > Current implementation of Lucene's QueryParser identifies a phrase in the > query when hitting a double-quotes char, even if it is mid-word. For example, > the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term > and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase > is a group of words surrounded by double quotes as defined by > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does > it say double-quotes will also tokenize the input. Arguably, a phrase should > only be identified as such when it is also surrounded by whitespaces. > Other than a logically incorrect behavior, this makes parsing of Hebrew > acronyms impossible. Hebrew acronyms contain one double-quotes char in the > middle of a word (for example, MNK"L), hence causing the QP to throw a syntax > exception, since it is expecting another double-quotes to create a phrase > query, essentially splitting the acronym into two. > The solution to this is pretty simple - changing the JavaCC syntax to check > if a whitespace precedes the double-quote when a phrase opening is expected, > or peek to see if a whitespace follows the double-quotes if a phrase closing > is expected. > This will both eliminate a logically incorrect behavior which shouldn't be > relied on anyway, and allow Hebrew queries to be correctly parsed also when > containing acronyms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word
QueryParser should ignore double-quotes if mid-word --- Key: LUCENE-2465 URL: https://issues.apache.org/jira/browse/LUCENE-2465 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 3.0.1, 3.0, 2.9.2, 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.3, Flex Branch, 3.0.2, 3.1, 4.0 Reporter: Itamar Syn-Hershko Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces. Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two. The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected. This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org