[jira] [Commented] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)

2019-08-16 Thread Itamar Syn-Hershko (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909145#comment-16909145
 ] 

Itamar Syn-Hershko commented on LUCENE-8565:


Heya - is this waiting for anything in particular that I can help in 
finalizing? Would really like to see this merged in. Thanks

> SimpleQueryParser to support field filtering (aka Add field:text operator)
> --
>
> Key: LUCENE-8565
> URL: https://issues.apache.org/jira/browse/LUCENE-8565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SimpleQueryParser lacks support for the `field:` operator for creating 
> queries which operate on fields other than the default field. Seems like one 
> can either get the parsed query to operate on a single field, or on ALL 
> defined fields (+ weights). No support for specifying `field:value` in the 
> query.
> It probably wasn't forgotten, but rather left out for simplicity, but since 
> we are using this QP implementation more and more (mostly through 
> Elasticsearch) we thought it would be useful to have it in.
> Seems like this is not too hard to pull off and I'll be happy to contribute a 
> patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)

2019-02-19 Thread Itamar Syn-Hershko (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771783#comment-16771783
 ] 

Itamar Syn-Hershko commented on LUCENE-8565:


I'm not sure what the Lucene versioning policy about that would be; but we can 
always change the default flag to turn off field filtering support

> SimpleQueryParser to support field filtering (aka Add field:text operator)
> --
>
> Key: LUCENE-8565
> URL: https://issues.apache.org/jira/browse/LUCENE-8565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SimpleQueryParser lacks support for the `field:` operator for creating 
> queries which operate on fields other than the default field. Seems like one 
> can either get the parsed query to operate on a single field, or on ALL 
> defined fields (+ weights). No support for specifying `field:value` in the 
> query.
> It probably wasn't forgotten, but rather left out for simplicity, but since 
> we are using this QP implementation more and more (mostly through 
> Elasticsearch) we thought it would be useful to have it in.
> Seems like this is not too hard to pull off and I'll be happy to contribute a 
> patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)

2018-11-14 Thread Itamar Syn-Hershko (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itamar Syn-Hershko updated LUCENE-8565:
---
Summary: SimpleQueryParser to support field filtering (aka Add field:text 
operator)  (was: SimpleQueryString to support field filtering (aka Add 
field:text operator))

> SimpleQueryParser to support field filtering (aka Add field:text operator)
> --
>
> Key: LUCENE-8565
> URL: https://issues.apache.org/jira/browse/LUCENE-8565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SimpleQueryString lacks support for the `field:` operator for creating 
> queries which operate on fields other than the default field. Seems like one 
> can either get the parsed query to operate on a single field, or on ALL 
> defined fields (+ weights). No support for specifying `field:value` in the 
> query.
> It probably wasn't forgotten, but rather left out for simplicity, but since 
> we are using this QP implementation more and more (mostly through 
> Elasticsearch) we thought it would be useful to have it in.
> Seems like this is not too hard to pull off and I'll be happy to contribute a 
> patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)

2018-11-14 Thread Itamar Syn-Hershko (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itamar Syn-Hershko updated LUCENE-8565:
---
Description: 
SimpleQueryParser lacks support for the `field:` operator for creating queries 
which operate on fields other than the default field. Seems like one can either 
get the parsed query to operate on a single field, or on ALL defined fields (+ 
weights). No support for specifying `field:value` in the query.

It probably wasn't forgotten, but rather left out for simplicity, but since we 
are using this QP implementation more and more (mostly through Elasticsearch) 
we thought it would be useful to have it in.

Seems like this is not too hard to pull off and I'll be happy to contribute a 
patch for it.

  was:
SimpleQueryString lacks support for the `field:` operator for creating queries 
which operate on fields other than the default field. Seems like one can either 
get the parsed query to operate on a single field, or on ALL defined fields (+ 
weights). No support for specifying `field:value` in the query.

It probably wasn't forgotten, but rather left out for simplicity, but since we 
are using this QP implementation more and more (mostly through Elasticsearch) 
we thought it would be useful to have it in.

Seems like this is not too hard to pull off and I'll be happy to contribute a 
patch for it.


> SimpleQueryParser to support field filtering (aka Add field:text operator)
> --
>
> Key: LUCENE-8565
> URL: https://issues.apache.org/jira/browse/LUCENE-8565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SimpleQueryParser lacks support for the `field:` operator for creating 
> queries which operate on fields other than the default field. Seems like one 
> can either get the parsed query to operate on a single field, or on ALL 
> defined fields (+ weights). No support for specifying `field:value` in the 
> query.
> It probably wasn't forgotten, but rather left out for simplicity, but since 
> we are using this QP implementation more and more (mostly through 
> Elasticsearch) we thought it would be useful to have it in.
> Seems like this is not too hard to pull off and I'll be happy to contribute a 
> patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)

2018-11-14 Thread Itamar Syn-Hershko (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686301#comment-16686301
 ] 

Itamar Syn-Hershko commented on LUCENE-8565:


PR submitted on github: [https://github.com/apache/lucene-solr/pull/498.] 
Reviews appreciated.

> SimpleQueryString to support field filtering (aka Add field:text operator)
> --
>
> Key: LUCENE-8565
> URL: https://issues.apache.org/jira/browse/LUCENE-8565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SimpleQueryString lacks support for the `field:` operator for creating 
> queries which operate on fields other than the default field. Seems like one 
> can either get the parsed query to operate on a single field, or on ALL 
> defined fields (+ weights). No support for specifying `field:value` in the 
> query.
> It probably wasn't forgotten, but rather left out for simplicity, but since 
> we are using this QP implementation more and more (mostly through 
> Elasticsearch) we thought it would be useful to have it in.
> Seems like this is not too hard to pull off and I'll be happy to contribute a 
> patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)

2018-11-14 Thread Itamar Syn-Hershko (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itamar Syn-Hershko updated LUCENE-8565:
---
Description: 
SimpleQueryString lacks support for the `field:` operator for creating queries 
which operate on fields other than the default field. Seems like one can either 
get the parsed query to operate on a single field, or on ALL defined fields (+ 
weights). No support for specifying `field:value` in the query.

It probably wasn't forgotten, but rather left out for simplicity, but since we 
are using this QP implementation more and more (mostly through Elasticsearch) 
we thought it would be useful to have it in.

Seems like this is not too hard to pull off and I'll be happy to contribute a 
patch for it.

  was:
SimpleQueryString lacks support for the `field:` operator for creating queries 
which operate on fields other than the default field. Seems like one can either 
get the parsed query to operate on a single field, or on ALL defined fields (+ 
weights). No support for specifying `field:value` in the query.

It probably wasn't forgotten, but rather left out for simplicity, but since we 
are using this QP implementation more and more (mostly through Elasticsearch) 
we thought it would be 

Seems like this is not too hard to pull off and I'll be happy to contribute a 
patch for it.


> SimpleQueryString to support field filtering (aka Add field:text operator)
> --
>
> Key: LUCENE-8565
> URL: https://issues.apache.org/jira/browse/LUCENE-8565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>
> SimpleQueryString lacks support for the `field:` operator for creating 
> queries which operate on fields other than the default field. Seems like one 
> can either get the parsed query to operate on a single field, or on ALL 
> defined fields (+ weights). No support for specifying `field:value` in the 
> query.
> It probably wasn't forgotten, but rather left out for simplicity, but since 
> we are using this QP implementation more and more (mostly through 
> Elasticsearch) we thought it would be useful to have it in.
> Seems like this is not too hard to pull off and I'll be happy to contribute a 
> patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)

2018-11-13 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-8565:
--

 Summary: SimpleQueryString to support field filtering (aka Add 
field:text operator)
 Key: LUCENE-8565
 URL: https://issues.apache.org/jira/browse/LUCENE-8565
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Itamar Syn-Hershko


SimpleQueryString lacks support for the `field:` operator for creating queries 
which operate on fields other than the default field. Seems like one can either 
get the parsed query to operate on a single field, or on ALL defined fields (+ 
weights). No support for specifying `field:value` in the query.

It probably wasn't forgotten, but rather left out for simplicity, but since we 
are using this QP implementation more and more (mostly through Elasticsearch) 
we thought it would be 

Seems like this is not too hard to pull off and I'll be happy to contribute a 
patch for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6302) Adding Date Math support to Lucene Expressions module

2015-02-26 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338677#comment-14338677
 ] 

Itamar Syn-Hershko commented on LUCENE-6302:


Sent a PR for the latter https://github.com/apache/lucene-solr/pull/129

> Adding Date Math support to Lucene Expressions module
> -
>
> Key: LUCENE-6302
> URL: https://issues.apache.org/jira/browse/LUCENE-6302
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/expressions
>Affects Versions: 4.10.3
>Reporter: Itamar Syn-Hershko
>
> Lucene Expressions are great, but they don't allow for date math. More 
> specifically, they don't allow to infer date parts from a numeric 
> representation of a date stamp, nor they allow to parse strings 
> representations to dates.
> Some of the features requested here easy to implement via ValueSource 
> implementation (and potentially minor changes to the lexer definition) , some 
> are more involved. I'll be happy if we could get half of those in, and will 
> be happy to work on a PR for the parts we can agree on.
> The items we will be happy to have:
> - A now() function (with or without TZ support) to return a current long 
> date/time value as numeric, that we could use against indexed datetime fields 
> (which are infact numerics)
> - Parsing methods - to allow to express datetime as strings, and / or read it 
> from stored fields and parse it from there. Parse errors would render a value 
> of zero.
> - Given a numeric value, allow to specify it is a date value and then infer 
> date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - 
> Date(1424963520).Year. Basically methods which return numerics but internally 
> create and use Date objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6302) Adding Date Math support to Lucene Expressions module

2015-02-26 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338563#comment-14338563
 ] 

Itamar Syn-Hershko commented on LUCENE-6302:


I actually expected the main objection would be to adding date parsing methods 
:)

Maybe it would make sense to explain the use cases this is trying to solve.

We are using Elasticsearch & Kibana and since the latest version switched to 
using Lucene Expressions (from Groovy) we found ourselves blocked by the things 
we can do with Kibana's scripted fields

For example, given a user's DOB, how can we do aggregations on their age? or 
compute how many years (or days) have passed between 2 given days?

Yes we can subtract the epochs (except that it doesn't seem to work 
https://github.com/elasticsearch/elasticsearch/issues/9884) but translating the 
result to terms of days, hours or years is even uglier using an expression.

I think introducing ValueSources to do this should be enough, but if changing 
the lexer will be the preferred way I can go and do that as well. With regards 
to syntax - I'm not locked on any preferred syntax.

Either way it seems like adding a now() function is the easiest change and can 
send a PR with this change alone to start with.

> Adding Date Math support to Lucene Expressions module
> -
>
> Key: LUCENE-6302
> URL: https://issues.apache.org/jira/browse/LUCENE-6302
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/expressions
>Affects Versions: 4.10.3
>Reporter: Itamar Syn-Hershko
>
> Lucene Expressions are great, but they don't allow for date math. More 
> specifically, they don't allow to infer date parts from a numeric 
> representation of a date stamp, nor they allow to parse strings 
> representations to dates.
> Some of the features requested here easy to implement via ValueSource 
> implementation (and potentially minor changes to the lexer definition) , some 
> are more involved. I'll be happy if we could get half of those in, and will 
> be happy to work on a PR for the parts we can agree on.
> The items we will be happy to have:
> - A now() function (with or without TZ support) to return a current long 
> date/time value as numeric, that we could use against indexed datetime fields 
> (which are infact numerics)
> - Parsing methods - to allow to express datetime as strings, and / or read it 
> from stored fields and parse it from there. Parse errors would render a value 
> of zero.
> - Given a numeric value, allow to specify it is a date value and then infer 
> date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - 
> Date(1424963520).Year. Basically methods which return numerics but internally 
> create and use Date objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6302) Adding Date Math support to Lucene Expressions module

2015-02-26 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-6302:
--

 Summary: Adding Date Math support to Lucene Expressions module
 Key: LUCENE-6302
 URL: https://issues.apache.org/jira/browse/LUCENE-6302
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/expressions
Affects Versions: 4.10.3
Reporter: Itamar Syn-Hershko


Lucene Expressions are great, but they don't allow for date math. More 
specifically, they don't allow to infer date parts from a numeric 
representation of a date stamp, nor they allow to parse strings representations 
to dates.

Some of the features requested here easy to implement via ValueSource 
implementation (and potentially minor changes to the lexer definition) , some 
are more involved. I'll be happy if we could get half of those in, and will be 
happy to work on a PR for the parts we can agree on.

The items we will be happy to have:

- A now() function (with or without TZ support) to return a current long 
date/time value as numeric, that we could use against indexed datetime fields 
(which are infact numerics)
- Parsing methods - to allow to express datetime as strings, and / or read it 
from stored fields and parse it from there. Parse errors would render a value 
of zero.
- Given a numeric value, allow to specify it is a date value and then infer 
date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - 
Date(1424963520).Year. Basically methods which return numerics but internally 
create and use Date objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-10 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241306#comment-14241306
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:


Sent them a request. I'll buy Robert beers if that could help pushing this 
forward!

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-10 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241214#comment-14241214
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:


Maybe out of scope of this ticket, but how do we go about #2? will be happy to 
take this discussion offline as well

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240392#comment-14240392
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:


0. You mean it implements UAX#29 version 6.3 :)

1. I'll likely be sending a PR for #1 sometime soon. Would you recommend using 
UAX#29 minus specific non-English tweaks, or fall back to 
ClassicStandardTokenizer which is English specific, or something else?

2. Here's the thing: the standard is wrong, or buggy. Ask any Swedish and they 
will tell you, and any non-Swedish corpus wouldn't care. And basically this is 
a bug in every Lucene based system today because of the word:word scenario; its 
a bit of an edge case but I bet I can find multiple occurrences in every big 
enough system. What can we do about that?

We already solved this using char filters, converting colons to a comma. It 
feels a bit hacky though, and again - this _is_ a flaw in Lucene's analysis 
even though it conforms to a Unicode standard.

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240133#comment-14240133
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:


Ok so I did some homework. In swedish, "connect" is a way to shortcut writings 
of words. So "C:a" is infact "cirka" which means "approximately". I guess it 
can be thought of as English acronyms, only apparently its way less commonly 
used in Swedish (my source says "very very seldomly used; old style and not 
used in modern Swedish at all").

Not only it is hardly being used, apparently it's only legal in 3 letter 
combinations (c:a but not c:ka).

And also, the affects it has are quite severe at the moment - 2 words with a 
colon in between that didn't have space will be outputted as one token even 
though its 100% its not applicable to Swedish, since each words has > 2 
characters.

I'm not aiming at changing the Unicode standards, that's way beyond my limited 
powers, but:

1. Given the above, does it really make sense to use this tokenizer in all 
language-specific analyzers as well? e.g. 
https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105

I'd think for language specific analyzers we'd want tokenizers aiming for this 
language with limited support for others. So, in this case, colon will always 
be considered a tokenizing char.

2. Can we change the jflex definition to at least limit the effects of this, 
e.g. only support colon as MidLetter if the overall token length == 3, so c:a 
is a valid token and word:word is not?

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240090#comment-14240090
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:


Good stuff, thanks Steve. I'm going to see if the rest of the UAX is good for 
us, and if so see if I can comply with the 6.2.5 version of the specs.

It's a good thing StandardTokenizer is no longer English centric, but I cannot 
imagine what use the colon has especially since in most cases it is not 
"something reasonable" :)

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239784#comment-14239784
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:


Yes, I figured it will be down to some Unicode rules. Can you give a rationale 
for this, mainly out of curiosity?

Not a Unicode expert, but I'd assume just like you wouldn't want English words 
to not-break on Hebrew Punctuation Gershayim (e.g. Test"Word is actually 2 
tokens and מנכ"לים is one), maybe this rule is meant for specific scenarios and 
not for the general use case?

On another note, any type of Gershayim should be preserved within Hebrew words, 
not only U+05F4. This is mainly because keyboards and editors used produce the 
standard " character in most cases. I had a chat with Robert a while back where 
he said that's the case, I'm just making sure you didn't follow the specs to 
the letter in that regard...

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5723) Performance improvements for FastCharStream

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239728#comment-14239728
 ] 

Itamar Syn-Hershko commented on LUCENE-5723:


Reported as https://java.net/jira/browse/JAVACC-285

> Performance improvements for FastCharStream
> ---
>
> Key: LUCENE-5723
> URL: https://issues.apache.org/jira/browse/LUCENE-5723
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/queryparser
>Reporter: Itamar Syn-Hershko
>Priority: Minor
>
> Hello from the .NET land,
> A user of ours has identified an optimization opportunity, although minor I 
> think it points to a valid point - we should avoid using exceptions from 
> controlling flow when possible.
> Here's the original ticket + commits to our codebase. If this looks valid to 
> you too I can go ahead and prepare a PR.
> https://issues.apache.org/jira/browse/LUCENENET-541
> https://github.com/apache/lucene.net/commit/ac8c9fa809110ddb180bf7b2ce93e86270b39ff6
> https://git-wip-us.apache.org/repos/asf?p=lucenenet.git;a=blobdiff;f=src/core/QueryParser/QueryParserTokenManager.cs;h=ec09c8e451f7a7d1572fbdce4c7598e362526a7c;hp=17583d20f660fdb6e4aa86105c7574383f965ebe;hb=41ebbc2d;hpb=ac8c9fa809110ddb180bf7b2ce93e86270b39ff6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5997) StandardFilter redundant

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239697#comment-14239697
 ] 

Itamar Syn-Hershko commented on LUCENE-5997:


Sounds good!

> StandardFilter redundant
> 
>
> Key: LUCENE-5997
> URL: https://issues.apache.org/jira/browse/LUCENE-5997
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.10.1
>Reporter: Itamar Syn-Hershko
>Priority: Trivial
>
> Any reason why StandardFilter is still around? its just a no-op class now:
>   @Override
>   public final boolean incrementToken() throws IOException {
> return input.incrementToken(); // TODO: add some niceties for the new 
> grammar
>   }
> https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

2014-12-09 Thread Itamar Syn-Hershko (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itamar Syn-Hershko updated LUCENE-6103:
---
Summary: StandardTokenizer doesn't tokenize word:word  (was: 
StandardTokenizer doesn't tokenizer word:word)

> StandardTokenizer doesn't tokenize word:word
> 
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.9
>Reporter: Itamar Syn-Hershko
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6103) StandardTokenizer doesn't tokenizer word:word

2014-12-09 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-6103:
--

 Summary: StandardTokenizer doesn't tokenizer word:word
 Key: LUCENE-6103
 URL: https://issues.apache.org/jira/browse/LUCENE-6103
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 4.9
Reporter: Itamar Syn-Hershko


StandardTokenizer (and by result most default analyzers) will not tokenize 
word:word and will preserve it as one token. This can be easily seen using 
Elasticsearch's analyze API:

localhost:9200/_analyze?tokenizer=standard&text=word%20word:word

If this is the intended behavior, then why? I can't really see the logic behind 
it.

If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5997) StandardFilter redundant

2014-10-07 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-5997:
--

 Summary: StandardFilter redundant
 Key: LUCENE-5997
 URL: https://issues.apache.org/jira/browse/LUCENE-5997
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 4.10.1
Reporter: Itamar Syn-Hershko
Priority: Trivial


Any reason why StandardFilter is still around? its just a no-op class now:

  @Override
  public final boolean incrementToken() throws IOException {
return input.incrementToken(); // TODO: add some niceties for the new 
grammar
  }

https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements

2014-06-18 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035978#comment-14035978
 ] 

Itamar Syn-Hershko commented on LUCENE-2841:


Can anyone review and comment?

> CommonGramsFilter improvements
> --
>
> Key: LUCENE-2841
> URL: https://issues.apache.org/jira/browse/LUCENE-2841
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 3.1, 4.0-ALPHA
>Reporter: Steve Rowe
>Priority: Minor
> Fix For: 4.9, 5.0
>
> Attachments: commit-6402a55.patch
>
>
> Currently CommonGramsFilter expects users to remove the common words around 
> which output token ngrams are formed, by appending a StopFilter to the 
> analysis pipeline.  This is inefficient in two ways: captureState() is called 
> on (trailing) stopwords, and then the whole stream has to be re-examined by 
> the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean 
> option controlling whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState() 
> will still need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather 
> than calling captureState() for the trailing token in each output token 
> ngram, the term text, position and offset can be maintained in the same way 
> as they are now for the leading token: using a System.arrayCopy()'d term 
> buffer and a few ints for positionIncrement and offsetd.  The user then no 
> longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4601) ivy availability check isn't quite right

2014-06-18 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035885#comment-14035885
 ] 

Itamar Syn-Hershko commented on LUCENE-4601:


May not be directly related, but I just tried running this: 
http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ on OSX Mavericks, 
with ant and ivy both installed via homebrew. Ivy was not found by and idea 
even when I placed a manually downloaded jar locally myself.

I had to run ivy-bootstrap to get off the ground - maybe it worths adding that 
to the docs

> ivy availability check isn't quite right
> 
>
> Key: LUCENE-4601
> URL: https://issues.apache.org/jira/browse/LUCENE-4601
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: general/build
>Reporter: Robert Muir
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4601.patch
>
>
> remove ivy from your .ant/lib but load it up on a build file like so:
> You have to lie to lucene's build, overriding ivy.available, because for some 
> reason the detection is wrong and will tell you ivy is not available, when it 
> actually is.
> I tried changing the detector to use available classname=some.ivy.class and 
> this didnt work either... so I don't actually know what the correct fix is.
> {noformat}
> 
>   
> 
>   
>uri="antlib:org.apache.ivy.ant" classpathref="ivy.lib.path" />
>   
>  failonerror="true">
>   
>   
>   
> 
>   
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5723) Performance improvements for FastCharStream

2014-05-31 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-5723:
--

 Summary: Performance improvements for FastCharStream
 Key: LUCENE-5723
 URL: https://issues.apache.org/jira/browse/LUCENE-5723
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Itamar Syn-Hershko
Priority: Minor


Hello from the .NET land,

A user of ours has identified an optimization opportunity, although minor I 
think it points to a valid point - we should avoid using exceptions from 
controlling flow when possible.

Here's the original ticket + commits to our codebase. If this looks valid to 
you too I can go ahead and prepare a PR.

https://issues.apache.org/jira/browse/LUCENENET-541
https://github.com/apache/lucene.net/commit/ac8c9fa809110ddb180bf7b2ce93e86270b39ff6
https://git-wip-us.apache.org/repos/asf?p=lucenenet.git;a=blobdiff;f=src/core/QueryParser/QueryParserTokenManager.cs;h=ec09c8e451f7a7d1572fbdce4c7598e362526a7c;hp=17583d20f660fdb6e4aa86105c7574383f965ebe;hb=41ebbc2d;hpb=ac8c9fa809110ddb180bf7b2ce93e86270b39ff6



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5358) Code cleanup on KStemmer

2013-12-03 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-5358:
--

 Summary: Code cleanup on KStemmer
 Key: LUCENE-5358
 URL: https://issues.apache.org/jira/browse/LUCENE-5358
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.6, 4.5.1, 4.5, 3.0
Reporter: Itamar Syn-Hershko
Priority: Minor


This affects all versions with KStemmer in them

The code of KStemmer needs some intensive cleanup, just to give you some idea 
on something that immediately popped up:

https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/KStemmer.java#L283-286

I'll be happy to do this myself, just wanted to check in advance to see if this 
is something you'd consider accepting in



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5011) MemoryIndex and FVH don't play along with multi-value fields

2013-05-21 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662950#comment-13662950
 ] 

Itamar Syn-Hershko commented on LUCENE-5011:


The actual test case we have now is very tightly coupled with ElasticSearch and 
our custom analysis chain, it may take me some time to decouple it into a 
stand-alone Lucene test. Alternatively, I'll be happy to work this out with you 
via Skype using our existing test case.

> MemoryIndex and FVH don't play along with multi-value fields
> 
>
> Key: LUCENE-5011
> URL: https://issues.apache.org/jira/browse/LUCENE-5011
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.3
>Reporter: Itamar Syn-Hershko
>
> When multi-value fields are indexed to a MemoryIndex, positions are computed 
> correctly on search but the start and end offsets and the values array index 
> aren't correct.
> Comparing the same execution path for IndexReader on a Directory impl  and 
> MemoryIndex (same document, same query, same analyzer, different Index impl), 
> the difference first shows in FieldTermStack.java line 125:
> termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), 
> pos, weight ) );
> dpEnum.startOffset() and dpEnum.endOffset don't match between implementations.
> This looks like a bug in MemoryIndex, which doesn't seem to handle tokenized 
> multi-value fields all too well when positions and offsets are required.
> I should also mention we are using an Analyzer which outputs several tokens 
> at a position (a la SynonymFilter), but I don't believe this is related.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5011) MemoryIndex and FVH don't play along with multi-value fields

2013-05-21 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-5011:
--

 Summary: MemoryIndex and FVH don't play along with multi-value 
fields
 Key: LUCENE-5011
 URL: https://issues.apache.org/jira/browse/LUCENE-5011
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Itamar Syn-Hershko


When multi-value fields are indexed to a MemoryIndex, positions are computed 
correctly on search but the start and end offsets and the values array index 
aren't correct.

Comparing the same execution path for IndexReader on a Directory impl  and 
MemoryIndex (same document, same query, same analyzer, different Index impl), 
the difference first shows in FieldTermStack.java line 125:

termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), 
pos, weight ) );

dpEnum.startOffset() and dpEnum.endOffset don't match between implementations.

This looks like a bug in MemoryIndex, which doesn't seem to handle tokenized 
multi-value fields all too well when positions and offsets are required.

I should also mention we are using an Analyzer which outputs several tokens at 
a position (a la SynonymFilter), but I don't believe this is related.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4673) TermQuery.toString() doesn't play nicely with whitespace

2013-01-09 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548874#comment-13548874
 ] 

Itamar Syn-Hershko commented on LUCENE-4673:


I figured as much, yet we would definitely like to have use this behavior 
built-in. Are there any plans on making such an interface to perform a proper 
Query -> String conversion?

> TermQuery.toString() doesn't play nicely with whitespace
> 
>
> Key: LUCENE-4673
> URL: https://issues.apache.org/jira/browse/LUCENE-4673
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.0-BETA, 4.1, 3.6.2
>Reporter: Itamar Syn-Hershko
>
> A TermQuery where term.text() contains whitespace outputs incorrect string 
> representation: field:foo bar instead of field:"foo bar". A "correct" 
> representation is such that could be parsed again to the correct Query object 
> (using the correct analyzer, yes, but still).
> This may not be so critical, but in our system we use Lucene's QP to parse 
> and then pre-process and optimize user queries. To do that we use 
> Query.toString on some clauses to rebuild the query string.
> This can be easily resolved by always adding quote marks before and after the 
> term text in TermQuery.toString. Testing to see if they are required or not  
> is too much work and TermQuery is ignorant of quote marks anyway.
> Some other scenarios which could benefit from this change is places where 
> escaped characters are used, such as URLs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4673) TermQuery.toString() doesn't play nicely with whitespace

2013-01-09 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-4673:
--

 Summary: TermQuery.toString() doesn't play nicely with whitespace
 Key: LUCENE-4673
 URL: https://issues.apache.org/jira/browse/LUCENE-4673
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.6.2, 4.0-BETA, 4.1
Reporter: Itamar Syn-Hershko


A TermQuery where term.text() contains whitespace outputs incorrect string 
representation: field:foo bar instead of field:"foo bar". A "correct" 
representation is such that could be parsed again to the correct Query object 
(using the correct analyzer, yes, but still).

This may not be so critical, but in our system we use Lucene's QP to parse and 
then pre-process and optimize user queries. To do that we use Query.toString on 
some clauses to rebuild the query string.

This can be easily resolved by always adding quote marks before and after the 
term text in TermQuery.toString. Testing to see if they are required or not  is 
too much work and TermQuery is ignorant of quote marks anyway.

Some other scenarios which could benefit from this change is places where 
escaped characters are used, such as URLs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements

2012-12-24 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539310#comment-13539310
 ] 

Itamar Syn-Hershko commented on LUCENE-2841:


Attached is a patch to fix this, including tests. There is no regression, and 
the new behavior when keepOrig is set to true is as described in the comments 
here.

The only thing I wasn't sure about was CommonGramsQueryFilter - should it be 
deprecated? or how should it be made to work with this change?

> CommonGramsFilter improvements
> --
>
> Key: LUCENE-2841
> URL: https://issues.apache.org/jira/browse/LUCENE-2841
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 3.1, 4.0-ALPHA
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 4.1
>
> Attachments: commit-6402a55.patch
>
>
> Currently CommonGramsFilter expects users to remove the common words around 
> which output token ngrams are formed, by appending a StopFilter to the 
> analysis pipeline.  This is inefficient in two ways: captureState() is called 
> on (trailing) stopwords, and then the whole stream has to be re-examined by 
> the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean 
> option controlling whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState() 
> will still need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather 
> than calling captureState() for the trailing token in each output token 
> ngram, the term text, position and offset can be maintained in the same way 
> as they are now for the leading token: using a System.arrayCopy()'d term 
> buffer and a few ints for positionIncrement and offsetd.  The user then no 
> longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2841) CommonGramsFilter improvements

2012-12-24 Thread Itamar Syn-Hershko (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itamar Syn-Hershko updated LUCENE-2841:
---

Attachment: commit-6402a55.patch

Adding option to CommonGramsFilter to not unigram common words

> CommonGramsFilter improvements
> --
>
> Key: LUCENE-2841
> URL: https://issues.apache.org/jira/browse/LUCENE-2841
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 3.1, 4.0-ALPHA
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 4.1
>
> Attachments: commit-6402a55.patch
>
>
> Currently CommonGramsFilter expects users to remove the common words around 
> which output token ngrams are formed, by appending a StopFilter to the 
> analysis pipeline.  This is inefficient in two ways: captureState() is called 
> on (trailing) stopwords, and then the whole stream has to be re-examined by 
> the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean 
> option controlling whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState() 
> will still need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather 
> than calling captureState() for the trailing token in each output token 
> ngram, the term text, position and offset can be maintained in the same way 
> as they are now for the leading token: using a System.arrayCopy()'d term 
> buffer and a few ints for positionIncrement and offsetd.  The user then no 
> longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4208) Spatial distance relevancy should use score of 1/distance

2012-09-08 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451430#comment-13451430
 ] 

Itamar Syn-Hershko commented on LUCENE-4208:


What's the status of this? are query results being properly sorted based on 
distance?

> Spatial distance relevancy should use score of 1/distance
> -
>
> Key: LUCENE-4208
> URL: https://issues.apache.org/jira/browse/LUCENE-4208
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/spatial
>Reporter: David Smiley
> Fix For: 4.0
>
>
> The SpatialStrategy.makeQuery() at the moment uses the distance as the score 
> (although some strategies -- TwoDoubles if I recall might not do anything 
> which would be a bug).  The distance is a poor value to use as the score 
> because the score should be related to relevancy, and the distance itself is 
> inversely related to that.  A score of 1/distance would be nice.  Another 
> alternative is earthCircumference/2 - distance, although I like 1/distance 
> better.  Maybe use a different constant than 1.
> Credit: this is Chris Male's idea.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4186) Lucene spatial's "distErrPct" is treated as a fraction, not a percent.

2012-09-02 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447037#comment-13447037
 ] 

Itamar Syn-Hershko commented on LUCENE-4186:


distErrPct makes sense to me - it makes more sense to talk about the expected 
error rate rather than actual given precision. Hence the name "Distance Error 
Percentage" makes perfect sense, although is tough to make an acronym of...

And while at it throw a bug fix in: SpatialArgs.toString should multiply 
distPrecision by 100, not divide it.

> Lucene spatial's "distErrPct" is treated as a fraction, not a percent.
> --
>
> Key: LUCENE-4186
> URL: https://issues.apache.org/jira/browse/LUCENE-4186
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Critical
> Fix For: 4.0
>
>
> The distance-error-percent of a query shape in Lucene spatial is, in a 
> nutshell, the percent of the shape's area that is an error epsilon when 
> considering search detail at its edges.  The default is 2.5%, for reference.  
> However, as configured, it is read in as a fraction:
> {code:xml}
>  class="solr.SpatialRecursivePrefixTreeFieldType"
>distErrPct="0.025" maxDetailDist="0.001" />
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage

2012-08-31 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445807#comment-13445807
 ] 

Itamar Syn-Hershko commented on LUCENE-4342:


I can confirm this is fixed now. Thanks for the fast turnaround!

> Issues with prefix tree's Distance Error Percentage 
> 
>
> Key: LUCENE-4342
> URL: https://issues.apache.org/jira/browse/LUCENE-4342
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial
>Affects Versions: 4.0-ALPHA, 4.0-BETA
>Reporter: Itamar Syn-Hershko
>Assignee: David Smiley
> Fix For: 4.0
>
> Attachments: 
> LUCENE-4342_fix_distance_precision_lookup_for_prefix_trees,_and_modify_the_default_algorit.patch,
>  unnamed.patch
>
>
> See attached patch for a failing test
> Basically, it's a simple point and radius scenario that works great as long 
> as args.setDistPrecision(0.0); is called. Once the default precision is used 
> (2.5%), it doesn't work as expected.
> The distance between the 2 points in the patch is 35.75 KM. Taking into 
> account the 2.5% error the effective radius without false negatives/positives 
> should be around 34.8 KM. This test fails with a radius of 33 KM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage

2012-08-29 Thread Itamar Syn-Hershko (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itamar Syn-Hershko updated LUCENE-4342:
---

Attachment: unnamed.patch

A failing test

> Issues with prefix tree's Distance Error Percentage 
> 
>
> Key: LUCENE-4342
> URL: https://issues.apache.org/jira/browse/LUCENE-4342
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial
>Affects Versions: 4.0-ALPHA, 4.0-BETA
>Reporter: Itamar Syn-Hershko
> Attachments: unnamed.patch
>
>
> See attached patch for a failing test
> Basically, it's a simple point and radius scenario that works great as long 
> as args.setDistPrecision(0.0); is called. Once the default precision is used 
> (2.5%), it doesn't work as expected.
> The distance between the 2 points in the patch is 35.75 KM. Taking into 
> account the 2.5% error the effective radius without false negatives/positives 
> should be around 34.8 KM. This test fails with a radius of 33 KM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage

2012-08-29 Thread Itamar Syn-Hershko (JIRA)
Itamar Syn-Hershko created LUCENE-4342:
--

 Summary: Issues with prefix tree's Distance Error Percentage 
 Key: LUCENE-4342
 URL: https://issues.apache.org/jira/browse/LUCENE-4342
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/spatial
Affects Versions: 4.0-BETA, 4.0-ALPHA
Reporter: Itamar Syn-Hershko
 Attachments: unnamed.patch

See attached patch for a failing test

Basically, it's a simple point and radius scenario that works great as long as 
args.setDistPrecision(0.0); is called. Once the default precision is used 
(2.5%), it doesn't work as expected.

The distance between the 2 points in the patch is 35.75 KM. Taking into account 
the 2.5% error the effective radius without false negatives/positives should be 
around 34.8 KM. This test fails with a radius of 33 KM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENENET-483) Spatial Search skipping records when one location is close to origin, another one is away and radius is wider

2012-05-21 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280179#comment-13280179
 ] 

Itamar Syn-Hershko commented on LUCENENET-483:
--

Here is a passing test 
https://github.com/synhershko/lucene.net/commit/41e745a2aff596f3f7b0e2842a7b5fa7b45d88d3

You can grab a compiled version of Spatial4n.core and 
Lucene.Net.Contrib.Spatial.dll from 
https://github.com/synhershko/ravendb/tree/spatial/SharedLibs

> Spatial Search skipping records when one location is close to origin, another 
> one is away and radius is wider
> -
>
> Key: LUCENENET-483
> URL: https://issues.apache.org/jira/browse/LUCENENET-483
> Project: Lucene.Net
>  Issue Type: Bug
>  Components: Lucene.Net Contrib
>Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g
> Environment: .Net framework 4.0
>Reporter: Aleksandar Panov
>  Labels: lucene, spatialsearch
> Fix For: Lucene.Net 3.0.3
>
>
> Running a spatial query against two locations where one location is close to 
> origin (less than a mile), another one is away (24 miles) and radius is wider 
> (52 miles) returns only one result. Running query with a bit wider radius 
> (53.8) returns 2 results.
> IMPORTANT UPDATE: Problem can't be reproduced in Java, with using original 
> Lucene.Spatial (2.9.4 version) library.
> {code}
> // Origin
> private double _lat = 42.350153;
> private double _lng = -71.061667;
> private const string LatField = "lat";
> private const string LngField = "lng";
> //Locations
> AddPoint(writer, "Location 1", 42.0, -71.0); //24 miles away from 
> origin
> AddPoint(writer, "Location 2", 42.35, -71.06); //less than a mile
> [TestMethod]
> public void TestAntiM()
> {
> _directory = new RAMDirectory();
> var writer = new IndexWriter(_directory, new 
> WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
> SetUpPlotter(2, 15);
> AddData(writer);
> _searcher = new IndexSearcher(_directory, true);
> //const double miles = 53.8; // Correct. Returns 2 Locations.
> const double miles = 52; // Incorrect. Returns 1 Location.
> Console.WriteLine("testAntiM");
> // create a distance query
> var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, 
> LngField, CartesianTierPlotter.DefaltFieldPrefix, true);
> Console.WriteLine(dq);
> //create a term query to search against all documents
> Query tq = new TermQuery(new Term("metafile", "doc"));
> var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter);
> Sort sort = new Sort(new SortField("foo", dsort, false));
> // Perform the search, using the term query, the distance filter, 
> and the
> // distance sort
> TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort);
> int results = hits.TotalHits;
> ScoreDoc[] scoreDocs = hits.ScoreDocs;
> // Get a list of distances
> Dictionary distances = dq.DistanceFilter.Distances;
> Console.WriteLine("Distance Filter filtered: " + distances.Count);
> Console.WriteLine("Results: " + results);
> Console.WriteLine("=");
> Console.WriteLine("Distances should be 2 " + distances.Count);
> Console.WriteLine("Results should be 2 " + results);
> Assert.AreEqual(2, distances.Count); // fixed a store of only 
> needed distances
> Assert.AreEqual(2, results);
> }
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (SOLR-3304) Add Solr support for the new Lucene spatial module

2012-05-19 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279609#comment-13279609
 ] 

Itamar Syn-Hershko commented on SOLR-3304:
--

In continuation to the discussion on the spatial4j list, +1 for having all the 
tests with actual spatial logic reside in the Lucene spatial module, and have 
the Solr tests rely on that

> Add Solr support for the new Lucene spatial module
> --
>
> Key: SOLR-3304
> URL: https://issues.apache.org/jira/browse/SOLR-3304
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 4.0
>Reporter: Bill Bell
>Assignee: David Smiley
>  Labels: spatial
> Attachments: SOLR-3304_Solr_fields_for_Lucene_spatial_module.patch
>
>
> Get the Solr spatial module integrated with the lucene spatial module.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[Lucene.Net] [jira] [Commented] (LUCENENET-407) Signing the assembly

2011-08-24 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090201#comment-13090201
 ] 

Itamar Syn-Hershko commented on LUCENENET-407:
--

Hmm... I just looked around the branches and couldn't see this committed 
anywhere. Ideas?

> Signing the assembly
> 
>
> Key: LUCENENET-407
> URL: https://issues.apache.org/jira/browse/LUCENENET-407
> Project: Lucene.Net
>  Issue Type: Improvement
>  Components: Lucene.Net Core
>Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x
>Reporter: Itamar Syn-Hershko
> Fix For: Lucene.Net 2.9.4, Lucene.Net 3.x
>
> Attachments: Lucene.NET.snk, signing.patch
>
>
> For our usage of Lucene.NET we need the assembly to be signed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Lucene.Net] [jira] [Commented] (LUCENENET-426) Mark BaseFragmentsBuilder methods as virtual

2011-06-22 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053511#comment-13053511
 ] 

Itamar Syn-Hershko commented on LUCENENET-426:
--

Apparently that was not enough. I hit a need to override this one too:

protected Field[] GetFields(IndexReader reader, int docId, String fieldName)

Perhaps it'd make sense to make all protected virtual? In Java you can override 
anything that is not final, so that would be compatible with the original 
version.

> Mark BaseFragmentsBuilder methods as virtual
> 
>
> Key: LUCENENET-426
> URL: https://issues.apache.org/jira/browse/LUCENENET-426
> Project: Lucene.Net
>  Issue Type: Improvement
>  Components: Lucene.Net Contrib
>Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x, 
> Lucene.Net 2.9.4g
>Reporter: Itamar Syn-Hershko
>Priority: Minor
> Fix For: Lucene.Net 2.9.4, Lucene.Net 2.9.4g
>
> Attachments: fvh.patch
>
>
> Without marking methods in BaseFragmentsBuilder as virtual, it is meaningless 
> to have FragmentsBuilder deriving from a class named "Base", since most of 
> its functionality cannot be overridden. Attached is a patch for marking the 
> important methods virtual.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (LUCENE-2215) paging collector

2011-05-12 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032359#comment-13032359
 ] 

Itamar Syn-Hershko commented on LUCENE-2215:


Thanks. I ended up using the standard Lucene paging code.

Hopefully this will get into Lucene soon...

> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2215) paging collector

2011-04-20 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022372#comment-13022372
 ] 

Itamar Syn-Hershko commented on LUCENE-2215:


Hi guys, any update on this?

I'm interested in using this for production code. Can anyone comment on how 
safe / mature this code is?

Thanks!

> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2518) Make check of BooleanClause.Occur[] in MultiFieldQueryParser.parse less stubborn

2010-06-28 Thread Itamar Syn-Hershko (JIRA)
Make check of BooleanClause.Occur[] in MultiFieldQueryParser.parse less stubborn


 Key: LUCENE-2518
 URL: https://issues.apache.org/jira/browse/LUCENE-2518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 3.0.2, 3.0.1, 3.0, 2.9.3, 2.9.2, 2.9.1, 2.9
Reporter: Itamar Syn-Hershko
Priority: Minor


Update the check in:

  public static Query parse(Version matchVersion, String query, String[] fields,
  BooleanClause.Occur[] flags, Analyzer analyzer) throws ParseException {
if (fields.length != flags.length)
  throw new IllegalArgumentException("fields.length != flags.length");

To be:
if (fields.length > flags.length)

So the consumer can use one Occur array and apply fields selectively. The only 
danger here is with hitting a non-existent cell in flags, and this check will 
provide this just as well without limiting usability for such cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

2010-05-17 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868183#action_12868183
 ] 

Itamar Syn-Hershko commented on LUCENE-2465:


bq. This is why i say, the only solution is to follow unicode. Adding hacks 
like this will only break other languages.

Problem is, Hebrew parsing has been broken for a long time now, and this still 
needs fixing. I don't think you should be forcing extra pre-handling for Hebrew 
or Bengali (or other) queries, just to keep CJK parsing working out of the box. 
Escaping those cases by the caller is a much more complex operation than a 
normal escape you'd do on your queries.

For languages where a colon is being used as a character, if indeed the use 
case is the same as mid-word gershayim (i.e. there's no key for that letter and 
it is more of a letter than a punctuation char), the issue with the QP is the 
same.

If the solution I had proposed initially wouldn't have caused other issues with 
CJK phrases, I'd insist on it. However, you are obviously right this change 
would break functionality for those languages, but you are wrong claiming it is 
not up to the query parser to resolve. As Shai have already pointed out, the QP 
should parse based on syntax with the smallest hassle to the consumer.

Obviously, a solution has to be provided, and it is agreed it should not affect 
the variety of supported languages. How about creating this functionality and 
leaving it as optional? for CJK you'd leave it off, while for all other 
languages (English and European) you could turn it on and feel no difference at 
the worse case scenario.

Or, you could have this setting accessible from your Analyzer. Analyzers are 
defining the core's behavior per-language, and as such it would make sense to 
make the QP check with the analyzer which cases are a syntax error and which 
aren't.

> QueryParser should ignore double-quotes if mid-word
> ---
>
> Key: LUCENE-2465
> URL: https://issues.apache.org/jira/browse/LUCENE-2465
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

2010-05-16 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867993#action_12867993
 ] 

Itamar Syn-Hershko commented on LUCENE-2465:


My point exactly - no one uses that character, and it will require a double 
pass on the string *always*. I pretty much have rest my case already, and it 
would have been clearer to you if you have been reading the language. Isn't 
Google treating those chars the same, or Wikipedia using just double-quotes, 
proof enough to my argument that double-quotes are allowed to be mid-word, that 
they 99.9% of the time are used that way, and that this isn't an incorrect 
behavior?

For Hebrew or other multi-lingual systems this will require always preparing 
the string before calling parse(), and this is definitely an unwanted behavior. 
Since the solution is *that* simple and non-breaking, I don't see why not just 
fix it - bug or not.

Any other opinions on the matter?

> QueryParser should ignore double-quotes if mid-word
> ---
>
> Key: LUCENE-2465
> URL: https://issues.apache.org/jira/browse/LUCENE-2465
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

2010-05-16 Thread Itamar Syn-Hershko (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867982#action_12867982
 ] 

Itamar Syn-Hershko commented on LUCENE-2465:


Using QueryParser.escape() is not an option, since by that I practically 
prevent the QP from ever returning PhraseQuery's on user queries (it just 
replaces all occurrences of a QP syntax char).

Your other suggestion of using the "correct" Unicode char GERSHAYIM is not 
doable, because we are talking about user-typed queries here, and no user has 
such a character on his keyboard. In 99.9% of Hebrew text files, old and new, 
double-quotes is being used as GERSHAYIM. Only exceptions are when an automated 
program has automatically converted the mid-word instance of double-quotes into 
U+05F4. This is pretty much like asking the Lucene community to type U+201C and 
U+201D (left / right double quotation marks) around phrases or they won't be 
recognized as such. Because no one has those characters easily accessible from 
their k/b (to the best of my knowledge), and it doesn't really matter anyway 
what you type, this thought never passed in anyone's mind. Exactly the same 
goes for Hebrew.

The only doable workaround is to go through the query string before sending it 
to the QP, and resolve this by either escaping mid-word double-quotes or 
replacing them with U+05F4. Since most Hebrew dictionaries work with 
double-quotes for acronyms anyway, escaping it seems much better, but then I 
ask again - why bother with a double-pass on the query string if a simple 
change to the QP can resolve that? The effect the behavior has on non-Hebrew 
scripts is flawed anyway, and there's no reason to require such a pass for 
Hebrew consumers only (imagine what it'd be like to write a multi-lingual 
search interface with this issue in mind).

As a reference, see how Google and Wikipedia treat Hebrew acronyms:
http://www.google.com/#hl=en&source=hp&q=%D7%9E%D7%A0%D7%9B%22%D7%9C&aq=f&aqi=&aql=&oq=&gs_rfai=&fp=d059ab474882bfe2
http://he.wikipedia.org/wiki/%D7%9E%D7%A0%D7%9B%22%D7%9C

Google recognizes both double-quotes and GERSHAYIM as correct forms of Hebrew 
acronyms, while Wikipedia only uses the former in all acronyms.

Robert, I hear what you are saying, but this just ain't right when it comes to 
usability, when the resolution is so simple and doesn't break anything.

> QueryParser should ignore double-quotes if mid-word
> ---
>
> Key: LUCENE-2465
> URL: https://issues.apache.org/jira/browse/LUCENE-2465
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

2010-05-15 Thread Itamar Syn-Hershko (JIRA)
QueryParser should ignore double-quotes if mid-word
---

 Key: LUCENE-2465
 URL: https://issues.apache.org/jira/browse/LUCENE-2465
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 3.0.1, 3.0, 2.9.2, 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1, 
2.3, 2.2, 2.1, 2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.3, Flex Branch, 3.0.2, 3.1, 4.0
Reporter: Itamar Syn-Hershko


Current implementation of Lucene's QueryParser identifies a phrase in the query 
when hitting a double-quotes char, even if it is mid-word. For example, the 
string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one 
PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group 
of words surrounded by double quotes as defined by 
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
it say double-quotes will also tokenize the input. Arguably, a phrase should 
only be identified as such when it is also surrounded by whitespaces.

Other than a logically incorrect behavior, this makes parsing of Hebrew 
acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
exception, since it is expecting another double-quotes to create a phrase 
query, essentially splitting the acronym into two.

The solution to this is pretty simple - changing the JavaCC syntax to check if 
a whitespace precedes the double-quote when a phrase opening is expected, or 
peek to see if a whitespace follows the double-quotes if a phrase closing is 
expected.

This will both eliminate a logically incorrect behavior which shouldn't be 
relied on anyway, and allow Hebrew queries to be correctly parsed also when 
containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org