[jira] Commented: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute

2010-04-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855699#action_12855699
 ] 

Uwe Schindler commented on SOLR-1876:
-

OK, all is fine!

> Convert all tokenstreams and tests to use CharTermAttribute
> ---
>
> Key: SOLR-1876
> URL: https://issues.apache.org/jira/browse/SOLR-1876
> Project: Solr
>  Issue Type: Task
>  Components: Schema and Analysis
>Affects Versions: 3.1
>Reporter: Robert Muir
> Fix For: 3.1
>
> Attachments: SOLR-1876.patch
>
>
> See the improvements in LUCENE-2302.
> TermAttribute has been deprecated for flexible indexing, as terms can really 
> be anything, as long as they can
> be serialized to byte[]. 
> For character-terms, a CharTermAttribute has been created, with a more 
> friendly API. Additionally this attribute
> implements the CharSequence and Appendable interfaces.
> We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute

2010-04-10 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855640#action_12855640
 ] 

Uwe Schindler commented on SOLR-1876:
-

Looks good, I will check this in more detail later.

> Convert all tokenstreams and tests to use CharTermAttribute
> ---
>
> Key: SOLR-1876
> URL: https://issues.apache.org/jira/browse/SOLR-1876
> Project: Solr
>  Issue Type: Task
>  Components: Schema and Analysis
>Affects Versions: 3.1
>Reporter: Robert Muir
> Fix For: 3.1
>
> Attachments: SOLR-1876.patch
>
>
> See the improvements in LUCENE-2302.
> TermAttribute has been deprecated for flexible indexing, as terms can really 
> be anything, as long as they can
> be serialized to byte[]. 
> For character-terms, a CharTermAttribute has been created, with a more 
> friendly API. Additionally this attribute
> implements the CharSequence and Appendable interfaces.
> We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854700#action_12854700
 ] 

Uwe Schindler commented on SOLR-1869:
-

bq. the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
attributes at the class level and not within its constructor

The filter is correct. That are final instance fields and the autogenerated 
ctor by javac does the same, so there is no need to move them to ctor. In 
Lucene/Solr all TokenStreams are done this way, thats our code style for 
TokenStreams.

The CharArrayMap is more performant in lookup, but you are right, we may need 
posincr. In general the Map should really be simply a CharArraySet or 
HashSet and the check should use contains.

But I dont understand the rest of the patch.

> RemoveDuplicatesTokenFilter doest have expected behaviour
> -
>
> Key: SOLR-1869
> URL: https://issues.apache.org/jira/browse/SOLR-1869
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Joe Calderon
>Priority: Minor
> Attachments: SOLR-1869.patch
>
>
> the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
> attributes at the class level and not within its constructor
> in addition i would think the expected behaviour would be to remove identical 
> terms with the same offset positions, instead it looks like it removes 
> duplicates based on position increment which wont work when using it after 
> something like the edgengram filter. when i posted this to the mailing list 
> even erik hatcher seemed to think thats what this filter was supposed to do...
> attaching a patch that has the expected behaviour and initializes variables 
> in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1824) partial field types created on error

2010-03-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845419#action_12845419
 ] 

Uwe Schindler commented on SOLR-1824:
-

It should be easy to fix. The init() method in the AbstractPluginLoader 
anonymous class checks for plugin!=null. In the null case it should throw 
exception to make the whole loadAnalyzer() call invalid, what makes the field 
type disappear.

> partial field types created on error
> 
>
> Key: SOLR-1824
> URL: https://issues.apache.org/jira/browse/SOLR-1824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Yonik Seeley
>Priority: Minor
>
> When abortOnConfigurationError=false, and there is a typo in one of the 
> filters in a chain, the field type is still created by omitting that 
> particular filter.  This is particularly dangerous since it will result in 
> incorrect indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-03-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1677:


Attachment: SOLR-1677-lucenetrunk-branch-3.patch
SOLR-1677-lucenetrunk-branch-2.patch

Just for documentation:
Here the patches with improvements to the version support for the Lucene-trunk 
upgrade branch.

- More lenient matchVersion support ("V.V")
- Default matchVersion for tests
- Remove code duplication and some additional checks for analysis plugins that 
need version support to enforce the version

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677-lucenetrunk-branch-2.patch, 
> SOLR-1677-lucenetrunk-branch-3.patch, SOLR-1677-lucenetrunk-branch.patch, 
> SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-03-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845214#action_12845214
 ] 

Uwe Schindler commented on SOLR-1677:
-

I also added support for instantiating Lucene Analyzers directly, that broke 
with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls 
back to the no-arg one. The only thing that is not working at the moment is the 
-Aware stuff, as SolrResourceLoader.newInstance() was not useable.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch, 
> SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-03-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1677:


Attachment: SOLR-1677-lucenetrunk-branch.patch

This patch was committed to the Lucene-trunk upgrade branch. It is changed to 
not make the factories CoreAware.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch, 
> SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798937#action_12798937
 ] 

Uwe Schindler commented on SOLR-1677:
-

{quote}
My suggestion for how to implement this would be...

# Add a new "luceneMatchVersion" attribute to the existing  tag.
# Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can 
use this to get the default.
# When init()ing new objects, include the key=>value pair of 
{{"luceneMatchVersion"=>schema.getLuceneMatchVersion()}} to the init method of 
the object if it's not already an init param for that particular instance.

This would eliminate the need to make any of the Analysis Factories 
SolrCoreAware (or even ResourceLoaderAware) just to know what the 
luceneMatchVersion should be -- the Base*Factories could still contain a 
{{protected Version luceneMatchVersion}} set by the base init() method that 
subclasses could use as needed.

NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg 
constructors" part of hte issue -- but it doesn't make it worse.  We can make 
IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg 
"Version" constructor fairly easily.  If/When we provide a more general 
mechanism for passing constructor args to Analyzers, any Version params could 
be defaulted just like with the factory init() methods.
{quote}

That was my proposal a few comments above. But: I still do not want it in 
schema.xml, as Version is a global Lucene thing! But the behaviour would be the 
same: The schema code can get the version from somewhere and pass it down to 
all schema components as you propose.

The Analyzers must have no-arg ctor is easy: Use reflection and look first for 
a ctor with Version, if exist use and pass ctor init/schema/config arg, if not 
exisatent use no-arg ctor. We already have this in Lucene's benchmark contrib 
since 3.0.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796872#action_12796872
 ] 

Uwe Schindler edited comment on SOLR-1677 at 1/5/10 10:29 PM:
--

In my opinion, the default in solrconfig.xml should be possible to set, because 
there is currently no requirement to set a version for all TS components. This 
default is in the shipped solrconfig.xml the version of the shipped lucene 
version. so new users can use the default config and extend it like learned in 
all courses and books about solr. They do not need to care about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous 
seeting and they are fine. If they want to change some of the components (like 
query parser, index writer, index reader -- flex!!!), they can do it locally. 
So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because 
they do not care about version (it is not required, because it defaults to 2.4 
for BW compatibility). So lots of configs will never use the new unicode 
features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for 
Lucene < 3 is removed, then all users cry. With a default version set to 2.4, 
they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 
is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, 
because it applies to *all* lucene components), then you should go the way like 
Lucene 3.0: Require a matchVersion for all components. As there may be 
tokenstream components not from lucene, make this attribute in the schema only 
mandatory for lucene-streams (this can be done by my initial patch, too: if the 
matchVersion property is missing then the matchVersion will get NULL and the 
factory should thow IAE if required. In my original patch, only the parsing 
code should be moved out of the factory into a util class in solr. Maybe also 
possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, 
because their configs get invalid. Ahh, and because they are stupid they add 
LUCENE_29 (from where should they know that Solr 1.4 used Lucene 2.4 
compatibility?). And then the mailing list gets flooded by questions because 
suddenly the configs fail to produce results with old indexes.

  was (Author: thetaphi):
In my opinion, the default in solrconfig.xml should be possible to set, 
because there is currently no requirement to set a version for all TS 
components. This default is in the shipped solrconfig.xml the version of the 
shipped lucene version. so new users can use the default config and extend it 
like learned in all courses and books about solr. They do not need to care 
about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous 
seeting and they are fine. If they want to change some of the components (like 
query parser, index writer, index reader -- flex!!!), they can do it locally. 
So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because 
they do not care about version (it is not required, because it defaults to 2.4 
for BW compatibility). So lots of configs will never use the new unicode 
features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for 
Lucene < 3 is removed, then all users cry. With a default version set to 2.4, 
they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 
is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, 
because it applies to *all* lucene components), then you should go the way like 
Lucene 3.0: Require a matchVersion for all components. As there may be 
tokenstream components not from lucene, make this attribute in the schema only 
mandatory for lucene-streams (this can be done by my initial patch, too: if the 
matchVersion property is missing then the matchVersion will get NULL and the 
factory should thow IAE if required. In my original patch, only the parsing 
code should be moved out of the factory into a util class in solr. Maybe also 
possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, 
because their configs get invalid.
  
> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796872#action_12796872
 ] 

Uwe Schindler commented on SOLR-1677:
-

In my opinion, the default in solrconfig.xml should be possible to set, because 
there is currently no requirement to set a version for all TS components. This 
default is in the shipped solrconfig.xml the version of the shipped lucene 
version. so new users can use the default config and extend it like learned in 
all courses and books about solr. They do not need to care about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous 
seeting and they are fine. If they want to change some of the components (like 
query parser, index writer, index reader -- flex!!!), they can do it locally. 
So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because 
they do not care about version (it is not required, because it defaults to 2.4 
for BW compatibility). So lots of configs will never use the new unicode 
features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for 
Lucene < 3 is removed, then all users cry. With a default version set to 2.4, 
they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 
is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, 
because it applies to *all* lucene components), then you should go the way like 
Lucene 3.0: Require a matchVersion for all components. As there may be 
tokenstream components not from lucene, make this attribute in the schema only 
mandatory for lucene-streams (this can be done by my initial patch, too: if the 
matchVersion property is missing then the matchVersion will get NULL and the 
factory should thow IAE if required. In my original patch, only the parsing 
code should be moved out of the factory into a util class in solr. Maybe also 
possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, 
because their configs get invalid.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795746#action_12795746
 ] 

Uwe Schindler commented on SOLR-1677:
-

The problem is the default value. If you leave out the version parameter 
instance-wise, you will get 2.4. And because of that all solr users will get 
stuck with that version and will never upgrade (because they leave the default 
and do not specify a different value). Because of backwards compatibility, we 
are limited to this version number as default value.

The schema/config global version is the global default used by all instances, 
that do not specify a different value. By that we can ship the default 
solconfig/schema.xml with the latest possible lucene version, but users 
upgrading will keep their default value.

I repeat: with instance-wise config, nobody will ever use it for new analyzers. 
With a global default, there is only *one* place that sets the version, which 
is also valid for user-added tokenizer chains.

For the SolrCore problem: For analyzers the idea its, that the default Version 
constant is automatically passed to all tokenizers in the param map 
automatically. Local values overwrite the key in the map. But this would only 
apply the analyzers. Other usages of Version at other places (QP, IW) still 
need SolrCore. But we can move the SolrCoreAware to the schema classes and not 
make every TokenFilter/Tokenizer SolrCoreAware.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794447#action_12794447
 ] 

Uwe Schindler commented on SOLR-1677:
-

Thanks for the hint. This means it can instantiate an analyzer via reflection 
and uses the zero-arg ctor, which is no longer available. So with Lucene 3.0 it 
will no longer work at all. As I have not so much experience with hacking Solr, 
I did not recognize this.

In my own project I have the same mechanism, for that i did a 
reflection-analysis of the loaded class and use the ctor with Version, if not 
avail an empty ctor.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-21 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1677:


Attachment: SOLR-1677.patch

New patch with some schema and config hacking. Also new test:

- As a first hack the solrConfig schema has a new element  
that contains a solr-wide default luceneMatchVersion value that is used as 
default for QueryParser, Analyzers if not specified different
- On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now 
extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and 
get the SolrConfig.
- Both classes now use the default, if not local set as a param (like in the 
last patch), but the default is the one got from SolrConfig
- The parser for config strings was moved to Config
- Other components like QueryParserFactories can get the default matchVersion 
in the same way
- The default is LUCENE_24 as before.

This is a first idea, how it would work. Open points:
- should the default be in SolrConfig or in IndexConfig?
- I did not change the config.xsd file to reflect my change as open discussion
- all other example config files and schemas should use the default Lucene 
version shipped with the solr release (currently 2.9). So user that upgrade get 
their last lucene version their index is compatible with, and new users get the 
latest config.
- If users upgrade the default luceneMatchVersion, they have to possibly 
reindex (esp. when upgrading to LUCENE_31 soon, as new Unicode features in all 
Tokenizers/Filters)

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1677:


Attachment: SOLR-1677.patch

Fix problem in one test, because the english stop word set is unmodifiable, so 
copy it.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793047#action_12793047
 ] 

Uwe Schindler commented on SOLR-1677:
-

bq. for example, if i was setting up a new solr configuration, i would want to 
say 'give me 3.1 support for all tokenstreams by default' ?

I have no idea how to define global properties in schema.xml that apply for all 
factories. If this is possible the LUCENE_24 else clause and the default value 
can be changed to the global default (which itsself defaults to 
Version.LUCENE_24). In this case the parser map (for Lucene 2.9/Java 1.4) on 
the version enum should also move to a more central page.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1677:


Attachment: SOLR-1677.patch

Better patch:
- more dynamic Version map creation
- improved warning message copied from Lucene's Javadocs on 
Version.LUCENE_CURRENT.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1677:


Attachment: SOLR-1677.patch

Patch.

I did not go through all factories, so maybe more need to be upgraded for 
matchVersion when switching to Lucene 3.0.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
> BaseTokenFilterFactory
> ---
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
>  Issue Type: Sub-task
>  Components: Schema and Analysis
>Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
> compatibility with old indexes created using older versions of Lucene. The 
> most important example is StandardTokenizer, which changed its behaviour with 
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
> much more Unicode support, almost every Tokenizer/TokenFilter needs this 
> Version parameter. In 2.9, the deprecated old ctors without Version take 
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base 
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
> contains a helper map to decode the version strings, but in 3.0 is can be 
> replaced by Version.valueOf(String), as the Version is a subclass of Java5 
> enums. The default value is Version.LUCENE_24 (as this is the default for the 
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from 
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
> to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Uwe Schindler (JIRA)
Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
BaseTokenFilterFactory
---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch

Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
compatibility with old indexes created using older versions of Lucene. The most 
important example is StandardTokenizer, which changed its behaviour with 
posIncr and incorrect host token types in 2.4 and also in 2.9.

In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
much more Unicode support, almost every Tokenizer/TokenFilter needs this 
Version parameter. In 2.9, the deprecated old ctors without Version take 
LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.

This patch adds basic support for the Lucene Version property to the base 
factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) 
/ Parameter (in 2.9) for constructing Tokenstreams. The code currently contains 
a helper map to decode the version strings, but in 3.0 is can be replaced by 
Version.valueOf(String), as the Version is a subclass of Java5 enums. The 
default value is Version.LUCENE_24 (as this is the default for the no-version 
ctors in Lucene).

This patch also removes unneeded conversions to CharArraySet from 
StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2009-12-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792998#action_12792998
 ] 

Uwe Schindler commented on SOLR-1657:
-

i would help working on this, but before SOLR-1674 should be committed. 
Otherwise I would get stuck with multiple patches.

> convert the rest of solr to use the new tokenstream API
> ---
>
> Key: SOLR-1657
> URL: https://issues.apache.org/jira/browse/SOLR-1657
> Project: Solr
>  Issue Type: Task
>Reporter: Robert Muir
>
> org.apache.solr.analysis:
> BufferedTokenStream
>  -> CommonGramsFilter
>  -> CommonGramsQueryFilter
>  -> RemoveDuplicatesTokenFilter
> CapitalizationFilterFactory
> HyphenatedWordsFilter
> LengthFilter (deprecated, remove)
> PatternTokenizerFactory (remove deprecated methods)
> SynonymFilter
> SynonymFilterFactory
> WordDelimiterFilter
> org.apache.solr.handler:
> AnalysisRequestHandler
> AnalysisRequestHandlerBase
> org.apache.solr.handler.component:
> QueryElevationComponent
> SpellCheckComponent
> org.apache.solr.highlight:
> DefaultSolrHighlighter
> org.apache.solr.search:
> FieldQParserPlugin
> org.apache.solr.spelling:
> SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1667) PatternTokenizer does not clearAttributes()

2009-12-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792378#action_12792378
 ] 

Uwe Schindler commented on SOLR-1667:
-

Too bad, this tokenizer was rewritten by me. How could i forget that? puh :(

> PatternTokenizer does not clearAttributes()
> ---
>
> Key: SOLR-1667
> URL: https://issues.apache.org/jira/browse/SOLR-1667
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
> Attachments: SOLR-1667.patch
>
>
> PatternTokenizer creates tokens, but never calls clearAttributes()
> because of this things like positionIncrementGap are never reset to their 
> default value.
> trivial patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791940#action_12791940
 ] 

Uwe Schindler commented on SOLR-1662:
-

+1 Looks good!

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>Assignee: Shalin Shekhar Mangar
> Attachments: SOLR-1662.patch
>
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791867#action_12791867
 ] 

Uwe Schindler commented on SOLR-1662:
-

bq. I think cloning should be done by sub-classes before writing. If 
BufferedTokenStream clones the token then every sub-class pays the price even 
though the use-case may just be to throw the token away.

+1, that was what i said in my first comment, too. Because BufferedTokenStream 
itsself never reuses the token. The problem is the test and RemoveDuplicates, 
that push the same instance twice into the queue.

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791370#action_12791370
 ] 

Uwe Schindler commented on SOLR-1662:
-

Just the short desription from the API side in Lucene:
Lucene's documentation of TokenStream.next() says: "The returned Token is a 
"full private copy" (not re-used across calls to next())". 
AB_AAB_Stream.process() duplicates the token by just putting it uncloned into 
the outQueue. As the consumer of the BufferedTokenStream assumes that the Token 
is private it is allowed to change it - and by that it also changes the token 
in the outQueue. If you e.g. put another TokenFilter in fromt of this 
AB_AAB_Stream, and modify the token there it would break.
In my opinion, the responsibility to clone is in AB_AAB_Stream, 
BufferedTokenStream will never return the same token twice by itsself. So its a 
bug in the test. But Robert told me that e.g. RemoveDuplicates has a similar 
problem.
The general contract for writing such streams is: whenever you return a Token 
from next(), never put it somewhere else uncloned, because the caller can 
change it.

The fix is to do: write((Token) t.clone());

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1221) Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

2009-10-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761763#action_12761763
 ] 

Uwe Schindler commented on SOLR-1221:
-

I would also stay with 2.9 in Solr. Just mark the removal of the wrapper as a 
TODO item after the next lucene update.

> Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by 
> default
> --
>
> Key: SOLR-1221
> URL: https://issues.apache.org/jira/browse/SOLR-1221
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 1.4
>
> Attachments: SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch, 
> SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch
>
>
> To improve the out of the box experience of Solr 1.4, I really think we 
> should make this change. You will still be able to turn both off.
> Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1221) Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

2009-10-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761759#action_12761759
 ] 

Uwe Schindler commented on SOLR-1221:
-

I have no preference...

But we fix the highlighter bug in lucene trunk, too?

> Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by 
> default
> --
>
> Key: SOLR-1221
> URL: https://issues.apache.org/jira/browse/SOLR-1221
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 1.4
>
> Attachments: SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch, 
> SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch
>
>
> To improve the out of the box experience of Solr 1.4, I really think we 
> should make this change. You will still be able to turn both off.
> Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1221) Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

2009-10-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761366#action_12761366
 ] 

Uwe Schindler commented on SOLR-1221:
-

A even simplier workaround:
Instead of using a NRQ, wrap a NRF with ConstantScoreQuery (just change 
TrieField.getRangeQuery()). You will loose auto-rewrite if only few terms are 
affected, but for precSteps>4/6, MTQ default would also use ConstantScoreQuery.

> Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by 
> default
> --
>
> Key: SOLR-1221
> URL: https://issues.apache.org/jira/browse/SOLR-1221
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 1.4
>
> Attachments: SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch, 
> SOLR-1221.patch, SOLR-1221.patch
>
>
> To improve the out of the box experience of Solr 1.4, I really think we 
> should make this change. You will still be able to turn both off.
> Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1221) Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

2009-10-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761350#action_12761350
 ] 

Uwe Schindler commented on SOLR-1221:
-

Does the highlighter rewrite before checking the query? In this is not the case 
the simpliest thing to do would be the following: Just wrap it into a Query 
subclass and rewrite it to NRQ.

> Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by 
> default
> --
>
> Key: SOLR-1221
> URL: https://issues.apache.org/jira/browse/SOLR-1221
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 1.4
>
> Attachments: SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch, 
> SOLR-1221.patch, SOLR-1221.patch
>
>
> To improve the out of the box experience of Solr 1.4, I really think we 
> should make this change. You will still be able to turn both off.
> Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1221) Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

2009-09-27 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760114#action_12760114
 ] 

Uwe Schindler commented on SOLR-1221:
-

Does RangeQuery really need to be highlighted? It is not used anywhere in Solr 
(I removed all occurences in the issue about TermRangeQuery), so why handle it?

In Lucene, the fix would only be needed for highlighting in 2.9.1, 3.0 will 
have no RangeQuery anymore.

> Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by 
> default
> --
>
> Key: SOLR-1221
> URL: https://issues.apache.org/jira/browse/SOLR-1221
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 1.4
>
> Attachments: SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch, 
> SOLR-1221.patch
>
>
> To improve the out of the box experience of Solr 1.4, I really think we 
> should make this change. You will still be able to turn both off.
> Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1221) Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by default

2009-09-27 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760111#action_12760111
 ] 

Uwe Schindler commented on SOLR-1221:
-

deprec ConstantScoreRangeQuery (if ever used in Solr) would also have the 
problem... (but it extends TermRangeQuery, so should be catched before).

bq. 3. use an overridden version of NumericQuery in Solr that returns a 
placeholder term from getTerm

would not work (final and no ctors).

> Change Solr Highlighting to use the SpanScorer with MultiTerm expansion by 
> default
> --
>
> Key: SOLR-1221
> URL: https://issues.apache.org/jira/browse/SOLR-1221
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 1.4
>
> Attachments: SOLR-1221.patch, SOLR-1221.patch, SOLR-1221.patch, 
> SOLR-1221.patch
>
>
> To improve the out of the box experience of Solr 1.4, I really think we 
> should make this change. You will still be able to turn both off.
> Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

2009-09-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757344#action_12757344
 ] 

Uwe Schindler commented on SOLR-908:


In my opinion, the problem is BufferedTokenStream (should its name not 
BufferedTokenFilter?). It has the linked list but does not implement reset(). 
So the problem is not this issue, more the usage of reset because you reuse the 
token stream. As long as BufferedTokenStream is not fixed to support reset() 
you have to create new instances.

> Port of Nutch  CommonGrams filter to Solr
> -
>
> Key: SOLR-908
> URL: https://issues.apache.org/jira/browse/SOLR-908
> Project: Solr
>  Issue Type: Wish
>  Components: Analysis
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, 
> SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch
>
>
> Phrase queries containing common words are extremely slow.  We are reluctant 
> to just use stop words due to various problems with false hits and some 
> things becoming impossible to search with stop words turned on. (For example 
> "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) 
>  
> Several postings regarding slow phrase queries have suggested using the 
> approach used by Nutch.  Perhaps someone with more Java/Solr experience might 
> take this on.
> It should be possible to port the Nutch CommonGrams code to Solr  and create 
> a suitable Solr FilterFactory so that it could be used in Solr by listing it 
> in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing. 
> Optimize phrase queries to use the n-grams. Single terms are still indexed 
> too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1423:


Attachment: SOLR-1423-fix-empty-tokens.patch

Attached a new patch with the empty token fix.

It has an additional test for the offsets, if group!=-1. It also is more 
optimized, as it uses setTermBuffer( string, offset, len) to copy the chars 
into the termbuffer, which is faster than allocating a new string with 
substring().

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, 
> SOLR-1423-fix-empty-tokens.patch, SOLR-1423-fix-empty-tokens.patch, 
> SOLR-1423-with-empty-tokens.patch, SOLR-1423.patch, SOLR-1423.patch, 
> SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755564#action_12755564
 ] 

Uwe Schindler commented on SOLR-1423:
-

Then you could use SOLR-1423-fix-empty-tokens.patch it should work. The 
comparison with String.split() in one of the tests was commented out, because 
it does not work with the tokenizer (as empty tokens are not returned).

I only wanted to check, that the offsets are calculated correctly. The second 
tests does this, but I want to be sure, that they are correct for both group=-1 
and group>=0.

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, 
> SOLR-1423-fix-empty-tokens.patch, SOLR-1423-with-empty-tokens.patch, 
> SOLR-1423.patch, SOLR-1423.patch, SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1423:


Attachment: SOLR-1423-fix-empty-tokens.patch

This is a patch that fixes the empty tokens:
This Tokenizer is not backwards compatible, as it only return non-zero length 
tokens. Maybe we should have a switch somewhere to change this behaviour. It is 
currently for discussion only.

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, 
> SOLR-1423-fix-empty-tokens.patch, SOLR-1423-with-empty-tokens.patch, 
> SOLR-1423.patch, SOLR-1423.patch, SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1423:


Attachment: SOLR-1423-with-empty-tokens.patch

Some refactoring (I moved the PatternTokenizer to its own class, like 
PatternReplaceFilter). This patch is functionally identical to current trunk, 
but more effective and uses new TokenStream API and implements end() (which 
sets the offset to the end of the string).

I will soon post a patch, which removes empty tokens.

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, 
> SOLR-1423-with-empty-tokens.patch, SOLR-1423.patch, SOLR-1423.patch, 
> SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755410#action_12755410
 ] 

Uwe Schindler commented on SOLR-1423:
-

I foget: I would deprecate the unneeded methods in the factory!

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, SOLR-1423.patch, 
> SOLR-1423.patch, SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755409#action_12755409
 ] 

Uwe Schindler commented on SOLR-1423:
-

bq. I think the empty tokens is a bug and should be omitted in this patch.

The Javadocs say, that it works like String.split() which return empty tokens, 
but strips empty tokens at the end of the string. This functionality is 
provided by Solr before and with this patch.
The code would get simplier, if the Tokenizer would generally strip empty 
tokens, but it is a backwards break. I would tend to just commit and then open 
another issue.

bq. Very nice! Can you open a separate ticket?

Will open one about Lucene's BaseTokenStreamTestCase 

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, SOLR-1423.patch, 
> SOLR-1423.patch, SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1423:


Attachment: SOLR-1423.patch

This is a complete more effective rewrite of the whole Tokenizer (I would like 
to put this into, Lucene contrib, too!) using the new TokenStream API.

When going through the code, I realized the following: This Tokenizer can 
return empty tokens, it only filters enpty tokens in split() mode. Is this 
exspected? If empty tokens should be omitted, the if (matcher.find()) should be 
replaced by while (match.find()) with if match.length==0 continue; - The logic 
behind the strange omit empty token at the end  will get very simple after this 
change.

This patch removes the whole split()/group() methods from the factory as not 
needed anymore. If this is a backwards break, replace them by not used dummies 
(e.g. initialize a Tokenizer and return the token's TermText).

In my opinion, one should never index empty tokens...

A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens 
without Token instances (which would no loger work, when Lucene 3.0 switches to 
Attributes only). Maybe you should update these test and use assertAnalyzesTo 
from the new base class instead.

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, SOLR-1423.patch, 
> SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1423:


Attachment: SOLR-1423-FieldType.patch

I searched for setOffset() in Solr source code and found one additional 
occurence of it without offset correcting in FieldType.java. This patch fixes 
this.

I will provide a completely streaming PatternTokenizer not using ArrayLists 
soon. It will move the while(matcher.find()) loop into the incrementToken() 
enumeration and will also use the new TokenStream API.

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423-FieldType.patch, SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754836#action_12754836
 ] 

Uwe Schindler commented on SOLR-1423:
-

I have seen this PatternTokenizer, too. The method is protected, as the 
corectOffset should only be called from inside Tokenizer (e.g. in 
incrementToken), never from the outside. Why does the PatternTokenizer does not 
have the methods newToken and so on in its own class (by the way: This should 
be updated to new TokenStream API)?

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> 
>
> Key: SOLR-1423
> URL: https://issues.apache.org/jira/browse/SOLR-1423
> Project: Solr
>  Issue Type: Task
>  Components: Analysis
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Koji Sekiguchi
> Fix For: 1.4
>
> Attachments: SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1404) Random failures with highlighting

2009-09-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754162#action_12754162
 ] 

Uwe Schindler commented on SOLR-1404:
-

bq. Will LUCENE-1906 fix it (in an alternate way)?

It should fix it. Lucene Tokenizer now do not have separate methods for 
CharStream anymore. They are simply handled as Readers. The trap of overwriting 
the wrong method should be fixed now. The offset correction is now done 
conditionally if the Reader is a CharStream subclass.

> Random failures with highlighting
> -
>
> Key: SOLR-1404
> URL: https://issues.apache.org/jira/browse/SOLR-1404
> Project: Solr
>  Issue Type: Bug
>  Components: Analysis, highlighter
>Affects Versions: 1.4
>Reporter: Anders Melchiorsen
> Fix For: 1.4
>
> Attachments: SOLR-1404.patch
>
>
> With a recent Solr nightly, we started getting errors when highlighting.
> I have not been able to reduce our real setup to a minimal one that is 
> failing, but the same error seems to pop up with the configuration below. 
> Note that the QUERY will mostly fail, but it will work sometimes. Notably, 
> after running "java -jar start.jar", the QUERY will work the first time, but 
> then start failing for a while. Seems that something is not being reset 
> properly.
> The example uses the deprecated HTMLStripWhitespaceTokenizerFactory but the 
> problem apparently also exists with other tokenizers; I was just unable to 
> create a minimal example with other configurations.
> SCHEMA
> 
> 
>   
> 
> 
>   
> 
>   
> 
>  
>  
>
>
>  
>  id
> 
> INDEX
> URL=http://localhost:8983/solr/update
> curl $URL --data-binary '1 name="test">test' -H 'Content-type:text/xml; 
> charset=utf-8'
> curl $URL --data-binary '' -H 'Content-type:text/xml; charset=utf-8'
> QUERY
> curl 'http://localhost:8983/solr/select/?hl.fl=test&hl=true&q=id:1'
> ERROR
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token test 
> exceeds length of provided text sized 4
> org.apache.solr.common.SolrException: 
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token test 
> exceeds length of provided text sized 4
>   at 
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:328)
>   at 
> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>   at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>   at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>   at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>   at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>   at org.mortbay.jetty.Server.handle(Server.java:285)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>   at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>   at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>   at 
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: 
> Token test exceeds length of provided text sized 4
>   at 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:254)
>   at 
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:321)
>   ... 23 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the i

[jira] Created: (SOLR-1423) Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others

2009-09-11 Thread Uwe Schindler (JIRA)
Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & others


 Key: SOLR-1423
 URL: https://issues.apache.org/jira/browse/SOLR-1423
 Project: Solr
  Issue Type: Task
  Components: Analysis
Reporter: Uwe Schindler


Because of some backwards compatibility problems (LUCENE-1906) we changed the 
CharStream/CharFilter API a little bit. Tokenizer now only has a input field of 
type java.io.Reader (as before the CharStream code). To correct offsets, it is 
now needed to call the Tokenizer.correctOffset(int) method, which delegates to 
the CharStream (if input is subclass of CharStream), else returns an 
uncorrected offset. Normally it is enough to change all occurences of 
input.correctOffset() to this.correctOffset() in Tokenizers. It should also be 
checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1334) SortableXXXField could use native FieldCache for sorting

2009-08-05 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739371#action_12739371
 ] 

Uwe Schindler commented on SOLR-1334:
-

Ok, so it should stay as it is.

The problem with NULL values in the FieldCache is a pain, I had this problem 
also in FieldCacheRangeFilter. Maybe in the complete overhaul task there should 
be some OpenBitSet/DocIdSet in parallel to the native arrays, that marks all 
valid values. E.g. it could be handled like a normal cache for a specific field 
and could be retrieved by FieldCache.getValidValues() or something like that. 
The bitset is build parallel to the uninversion. If the field name is the same, 
the valid values are also the same (not related to data type).

What do you think?

> SortableXXXField could use native FieldCache for sorting
> 
>
> Key: SOLR-1334
> URL: https://issues.apache.org/jira/browse/SOLR-1334
> Project: Solr
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>
> When looking through the FieldTypes (esp. new Trie code), I found out that 
> field types using org.apache.solr.util.NumberUtils use String sorting. As 
> SortField can get a FieldCache Parser since LUCENE-1478, NumberUtils could 
> supply FieldCache.Parser singletons (serializable singletons!) for the 
> SortableXXXField types, and the SortField instances could use these parsers 
> instead of STRING only SortFields.
> The same parsers could be used to create ValueSources for these types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1334) SortableXXXField could use native FieldCache for sorting

2009-08-04 Thread Uwe Schindler (JIRA)
SortableXXXField could use native FieldCache for sorting


 Key: SOLR-1334
 URL: https://issues.apache.org/jira/browse/SOLR-1334
 Project: Solr
  Issue Type: Improvement
Reporter: Uwe Schindler


When looking through the FieldTypes (esp. new Trie code), I found out that 
field types using org.apache.solr.util.NumberUtils use String sorting. As 
SortField can get a FieldCache Parser since LUCENE-1478, NumberUtils could 
supply FieldCache.Parser singletons (serializable singletons!) for the 
SortableXXXField types, and the SortField instances could use these parsers 
instead of STRING only SortFields.

The same parsers could be used to create ValueSources for these types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1322) range queries won't work for trie fields with precisionStep=0

2009-08-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739069#action_12739069
 ] 

Uwe Schindler commented on SOLR-1322:
-

I added a test to Lucene Core that verifies, that multi-valued terms work 
correctly: Revision #800896  


> range queries won't work for trie fields with precisionStep=0
> -
>
> Key: SOLR-1322
> URL: https://issues.apache.org/jira/browse/SOLR-1322
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.4
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 1.4
>
>
> range queries won't work for trie fields with precisionStep=0... a normal 
> range query should be used in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1322) range queries won't work for trie fields with precisionStep=0

2009-08-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738993#action_12738993
 ] 

Uwe Schindler commented on SOLR-1322:
-

Correct, but it is good to think about it multiple times. I always fall into 
the same trap when thinking about it, but as soon as I have a picture with the 
indexed terms it gets clear again. I think, I should write a test about this 
special case inside TestNumericRangeQueryXX (index multiple values and do some 
ranges with precStep=inf and real precStep on the same field and compare 
results).

> range queries won't work for trie fields with precisionStep=0
> -
>
> Key: SOLR-1322
> URL: https://issues.apache.org/jira/browse/SOLR-1322
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.4
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 1.4
>
>
> range queries won't work for trie fields with precisionStep=0... a normal 
> range query should be used in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1322) range queries won't work for trie fields with precisionStep=0

2009-08-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738986#action_12738986
 ] 

Uwe Schindler commented on SOLR-1322:
-

bq. If trie fields are indexed in parts, NumericRangeQuery will produce invalid 
results for multiValued fields... that's a limitation of the trie encoding (not 
easily fixable at all). 

I do not really understand the problem with MultiValueFields. I had trie fields 
in my index in the past that had multiple trie values and numeric range queries 
worked with it. What is the problem? You should be able to add more than one 
value using separate Field instances to the index.

A NumericRangeQuery on a MultiValued field should show results for all 
documents as soon as *one* of the indexed values fall into the range. Correct 
me if I am wrong!

> range queries won't work for trie fields with precisionStep=0
> -
>
> Key: SOLR-1322
> URL: https://issues.apache.org/jira/browse/SOLR-1322
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.4
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 1.4
>
>
> range queries won't work for trie fields with precisionStep=0... a normal 
> range query should be used in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1261) Lucene trunk renamed RangeQuery & Co to TermRangeQuery

2009-07-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731614#action_12731614
 ] 

Uwe Schindler commented on SOLR-1261:
-

The only problem with the above code are Date Trie Queries, as they would be 
printed as long values... (effectively toExternal(Long.toString())  :(

> Lucene trunk renamed RangeQuery & Co to TermRangeQuery
> --
>
> Key: SOLR-1261
> URL: https://issues.apache.org/jira/browse/SOLR-1261
> Project: Solr
>  Issue Type: Task
>  Components: search
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-1261.patch
>
>
> I committed shortly ago LUCENE-1713, that renamed RangeQuery to 
> TermRangeQuery (and also RangeFilter -> TermRangeFilter). The API of the old 
> deprecated RangeQuery and RangeFilter classes was reverted to the state of 
> Lucene 2.4, only the new classes contain the improvements of 2.9. So Solr 
> will not compile anymore, because the new ctors of RangeQuery and 
> setConstantScoreRewrite are no longer available, but were already included 
> into Solr.
> This can be solved by simply replacing RangeQuery to TermRangeQuery in the 
> source.
> There were some minor cleanups with the API, because there must not be any 
> strange methods anmes because of BW compatibility in the new class. Also all 
> ctors using Term are only available in the deprecated classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1261) Lucene trunk renamed RangeQuery & Co to TermRangeQuery

2009-07-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731609#action_12731609
 ] 

Uwe Schindler commented on SOLR-1261:
-

{quote}
bq. In my opinion, QueryParsing.java should now also be able to create a string 
representation of NumericRangeQueries, I did this, too (related to SOLR-940).

The only problem is that Solr always prints out the value given by 
FieldType#toExternal which TermRangeQuery#toString wouldn't know about. So I 
guess we should leave it as is.
{quote}

You misunderstood me. I did not change anything in TermRangeQuery, the code is 
identical to that before (only ConstantScoreRangeQuery and RangeQuery replaced 
by TermRangeQuery).

What I have done as new contribution in this patch is, that I extended 
QueryParsing.java to print out the correct numeric query representation (also 
using toExternal and so on):

{code}
if (query instanceof NumericRangeQuery) {
  NumericRangeQuery q = (NumericRangeQuery)query;
  String fname = q.getField();
  FieldType ft = writeFieldName(fname, schema, out, flags);
  out.append( q.includesMin() ? '[' : '{' );
  Number lt = q.getMin();
  Number ut = q.getMax();
  if (lt==null) {
out.append('*');
  } else {
writeFieldVal(lt.toString(), ft, out, flags);
  }

  out.append(" TO ");

  if (ut==null) {
out.append('*');
  } else {
writeFieldVal(ut.toString(), ft, out, flags);
  }

  out.append( q.includesMax() ? ']' : '}' );
}
{code}

> Lucene trunk renamed RangeQuery & Co to TermRangeQuery
> --
>
> Key: SOLR-1261
> URL: https://issues.apache.org/jira/browse/SOLR-1261
> Project: Solr
>  Issue Type: Task
>  Components: search
>Affects Versions: 1.4
>Reporter: Uwe Schindler
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-1261.patch
>
>
> I committed shortly ago LUCENE-1713, that renamed RangeQuery to 
> TermRangeQuery (and also RangeFilter -> TermRangeFilter). The API of the old 
> deprecated RangeQuery and RangeFilter classes was reverted to the state of 
> Lucene 2.4, only the new classes contain the improvements of 2.9. So Solr 
> will not compile anymore, because the new ctors of RangeQuery and 
> setConstantScoreRewrite are no longer available, but were already included 
> into Solr.
> This can be solved by simply replacing RangeQuery to TermRangeQuery in the 
> source.
> There were some minor cleanups with the API, because there must not be any 
> strange methods anmes because of BW compatibility in the new class. Also all 
> ctors using Term are only available in the deprecated classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-07-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731567#action_12731567
 ] 

Uwe Schindler commented on SOLR-940:


Patch looks good!

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-1261-1241.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1701-addition.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-07-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731396#action_12731396
 ] 

Uwe Schindler commented on SOLR-940:


I think your problem is solved now (thanks Mike).

If you update to latest trunk, you must also apply SOLR-1261 (rename of 
RangeQuery to TermRangeQuery).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701-addition.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-07-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: (was: SOLR-940-LUCENE-1701-addition.patch)

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701-addition.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-07-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-LUCENE-1701-addition.patch

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701-addition.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-07-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: (was: SOLR-940-LUCENE-1701-addition.patch)

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701-addition.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-07-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-LUCENE-1701-addition.patch

Same patch updated, but uses the new feature of TrieRange to specify any large 
precStep to index only one token (uses now Integer.MAX_VALUE as precStep for 
the query tokenizer).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701-addition.patch, SOLR-940-LUCENE-1701-addition.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-07-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-LUCENE-1701-addition.patch

Hi Shalin,

here is an additional patch (but only for the trie parts), that is more 
intelligent and also uses NumericTokenStream for the query time factory. Your 
previous patch must be applied, then revert the changes in 
analysis.TrieXxxxTokenizerFactory and TrieField. Then apply the patch, which 
removes the old factories and creates a new one TrieTokenizerFactory. It should 
compile, but not really tested (it is hard to apply all your changes). If there 
are compile errors, they can be easily fixed :-)

The idea is to use the same tokenstream for query time analysis. To only 
produce the highest precision token needed for that, it is simply using a 
precisionStep of 32 for int/float and 64 for long/double/date of the former 
TrieIndexTokenizerFactory. No magic with KeywordTokenizer needed. NumericUtils, 
which is a expert Lucene class (not really public) is not needed anymore.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701-addition.patch, SOLR-940-LUCENE-1701.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1261) Lucene trunk renamed RangeQuery & Co to TermRangeQuery

2009-07-04 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1261:


Attachment: SOLR-1261.patch

Attached is a patch, that does this (untested, but should work).

In my opinion, QueryParsing.java should now also be able to create a string 
representation of NumericRangeQueries, I did this, too (related to SOLR-940).

> Lucene trunk renamed RangeQuery & Co to TermRangeQuery
> --
>
> Key: SOLR-1261
> URL: https://issues.apache.org/jira/browse/SOLR-1261
> Project: Solr
>  Issue Type: Task
>  Components: search
>Affects Versions: 1.4
>Reporter: Uwe Schindler
> Attachments: SOLR-1261.patch
>
>
> I committed shortly ago LUCENE-1713, that renamed RangeQuery to 
> TermRangeQuery (and also RangeFilter -> TermRangeFilter). The API of the old 
> deprecated RangeQuery and RangeFilter classes was reverted to the state of 
> Lucene 2.4, only the new classes contain the improvements of 2.9. So Solr 
> will not compile anymore, because the new ctors of RangeQuery and 
> setConstantScoreRewrite are no longer available, but were already included 
> into Solr.
> This can be solved by simply replacing RangeQuery to TermRangeQuery in the 
> source.
> There were some minor cleanups with the API, because there must not be any 
> strange methods anmes because of BW compatibility in the new class. Also all 
> ctors using Term are only available in the deprecated classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1261) Lucene trunk renamed RangeQuery & Co to TermRangeQuery

2009-07-04 Thread Uwe Schindler (JIRA)
Lucene trunk renamed RangeQuery & Co to TermRangeQuery
--

 Key: SOLR-1261
 URL: https://issues.apache.org/jira/browse/SOLR-1261
 Project: Solr
  Issue Type: Task
  Components: search
Affects Versions: 1.4
Reporter: Uwe Schindler


I committed shortly ago LUCENE-1713, that renamed RangeQuery to TermRangeQuery 
(and also RangeFilter -> TermRangeFilter). The API of the old deprecated 
RangeQuery and RangeFilter classes was reverted to the state of Lucene 2.4, 
only the new classes contain the improvements of 2.9. So Solr will not compile 
anymore, because the new ctors of RangeQuery and setConstantScoreRewrite are no 
longer available, but were already included into Solr.

This can be solved by simply replacing RangeQuery to TermRangeQuery in the 
source.

There were some minor cleanups with the API, because there must not be any 
strange methods anmes because of BW compatibility in the new class. Also all 
ctors using Term are only available in the deprecated classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725605#action_12725605
 ] 

Uwe Schindler commented on SOLR-940:


Yes, it now works with RangeQuery/Filter (as before), NumericRangeQuery/Filter 
and FieldCacheRangeFilter.

I will fix the strange usage of Term instance when we deprecate the old 
RangeQuery in favour of TermRangeQuery & Co. (LUCENE-1713).
The current check in RangeQuery ony prevents you to create a RangeQuery using 
the Term instances (instead of field, string, string), where both are null 
(because with both terms entirely null, no field name is available).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725600#action_12725600
 ] 

Uwe Schindler commented on SOLR-940:


bq. I think you are fixing it the wrong way. 

You misunderstood, I meant:
I fix it, that it is clear what it really does. I will not change RangeQuerys 
behaviour, I will remove the whole internal Term handling in LUCENE-1713 and 
only use String field, lower, upper. Then it is clear how it works. The current 
code has this strange behaviour (how it handles Term instances)  because of the 
retrofitting of RangeQuery to MultiTermQuery.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725594#action_12725594
 ] 

Uwe Schindler commented on SOLR-940:


Fixed in Lucene trunk rev 789692. The strange null handling in RangeQuery 
(which caused by change) will be fixed together in LUCENE-1713, when RangeQuery 
will be deprecated and renamed.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725590#action_12725590
 ] 

Uwe Schindler commented on SOLR-940:


You are right, but normally a new Term(field,null) should be not allowed. The 
init method should normally prevent this, but only checks for the terms ==null. 
The RangeTermEnum is then positioned on the null term (should be "").

I will change this back (also in FieldCacheRangeFilter) and fix the wrong logic 
of RangeQuery to clearly support it.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-940) TrieRange support

2009-06-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725568#action_12725568
 ] 

Uwe Schindler edited comment on SOLR-940 at 6/30/09 4:30 AM:
-

Oh, this was intended.

the reason was, that all other range filters in lucene core do not allow this. 
In general one should use a MatchAllDocsQuery in this case, as it is more 
performant.
I could enable it again, but I have to think about the other range queries and 
filters then.

How do you handle that with other range queries?

  was (Author: thetaphi):
Oh, this was intended.

the reason was, that all other range filters in lucene core do not allow this. 
In general one should use a MatchAllDocsQuery in this case, as it is more 
performant.
I could enable it again, but I have to think about the other range queries and 
filters then.
  
> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725568#action_12725568
 ] 

Uwe Schindler commented on SOLR-940:


Oh, this was intended.

the reason was, that all other range filters in lucene core do not allow this. 
In general one should use a MatchAllDocsQuery in this case, as it is more 
performant.
I could enable it again, but I have to think about the other range queries and 
filters then.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723938#action_12723938
 ] 

Uwe Schindler commented on SOLR-940:


{quote}
Regarding Collector#acceptsDocsOutOfOrder, I think we need to
# Return true when we do not need scores, otherwise false. 
# DocSetCollector and DocSetDelegateCollector collect in order so we return 
false 
It'd be great if someone who know more about this stuff can confirm.
{quote}

My explanation without guarantee: If you set it to true or false depends on 
your collector not on the type of query or sorting or you need scores. It gives 
the query engine a hint, if it is possible to deliver the doc ids out of order.

Simple case is the example in the Collector JavaDocs: if you just mark the 
docids in an OpenBitSet, the order is irrelevant (bitset is not faster/slower 
when it does not get the docs in correct order). On the other hand collectors 
like TopDocs and so on can be optimized to be faster when the docs come in 
order. One example would be: if you read stored fields of documents using the 
setNextReader() given indexReader, it may be good to have the docs in order to 
avoid back/forward seeking all the time.

bq. I'm also seeing this exception in many tests (DisMaxRequestHandlerTest, 
TestTrie, TestDistributedSearch) which, I guess, are related to LUCENE-1630

I think, this is because you have a custom query type which implements an own 
weight. There are possibilities to fix this using a wrapper, not sure.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-06-23 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-LUCENE-1701.patch

Patch with changes for new Trie API in Lucene Core, the term "trie" does not 
appear anymore in Lucene (its now NumericRangeQuery, NumericTokenStream, 
NumericField, NumericUtils). This patch only contains changes for Trie and 
FieldCache/ExtendedFieldCache merging (as this affects trie, ExtendedFieldCache 
was deprecated in Lucene and merged into FieldCache. LongParsers now extend 
FieldCache.LongParser, for backwards compatibility there is a 
ExFieldCache.LongParser, too, but the new TrieAPI cannot handle this. So all 
occurences to ExtendedFieldCache must be removed from Solr)

The latest changes to Collector (new abstract method handleDocsOutOfOrder()) 
are not handled!!! Patch is therefore untested, but should work.

There is also FSDirectory-Factory of Solr changed to use the new 
FSDirectory.open() call that is the same like your factory (chooses dir 
dependent on platform).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-LUCENE-1701.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-06-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721947#action_12721947
 ] 

Uwe Schindler commented on SOLR-940:


The first part of the move to core is done, when the second part (LUCENE-1701) 
is done, I will post a patch!

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-465) Add configurable DirectoryProvider so that alternate Directory implementations can be specified via solrconfig.xml

2009-06-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715282#action_12715282
 ] 

Uwe Schindler commented on SOLR-465:


I committed a changed version of the FSDirectory today to Lucene trunk 
(LUCENE-1658). The ctor/class name of the default directory impl is now 
SimpleFSDirectory vs. NIOFSDirectory and MMapDirectory. All three dirs can no 
instantiated by a one-arg and two-arg ctor (File[, LockFactory]).

FSDirectory will get abstract in 3.0 and so cannot be instantiated (only 
protected ctor). It is now also a factory, to choose automatically the best 
variant for the platform: FSDirectory.open(File[, LockFactory])

> Add configurable DirectoryProvider so that alternate Directory 
> implementations can be specified via solrconfig.xml
> --
>
> Key: SOLR-465
> URL: https://issues.apache.org/jira/browse/SOLR-465
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.3
>Reporter: TJ Laurenzo
>Assignee: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-465-fixes.patch, SOLR-465.patch, SOLR-465.patch, 
> SOLR-465.patch, SOLR-465.patch, SOLR-465.patch, SOLR-465.patch, 
> solr-directory-provider.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Solr is presently hard-coded to use the FSDirectory implementation in Lucene. 
>  Other Directory implementations are possible.  This patch creates a new 
> DirectoryProvider interface and extends SolrCore to load an implementation of 
> it from solrconfig.xml (if specified).  If not specified, then it will 
> fallback to the FSDirectory.
> A DirectoryProvider plugin can be configured in solrconfig.xml with the 
> following XML:
>
>   
>
> This patch was created against solr trunk checked out on 11/20/2007.  Most of 
> it is new code and should apply cleanly or with minor relocation.  If it does 
> not, let me know and I will update.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708825#action_12708825
 ] 

Uwe Schindler commented on SOLR-773:


{quote}
Agreed on the first, not 100% certain on the second.  On the second, this issue 
is the gate keeper.  If people reviewing the patch feel there are better ways 
to do things, then we should work through them before committing.  What you are 
effectively seeing is an increase in the developers working on from 1 to many, 
it's just not on committed code.
{quote}
I aggree with iterating about the patch and also LocalLucene (not only 
LocalSolr).

{quote}
On the first point, I don't follow.  Isn't LocalLucene and LocalSolr, just 
exactly a GIS search capability for Lucene/Solr?  I'm not sure if I would 
categorize it as shoe-horning.  There are many things that Lucene/Solr can 
power, GIS search with text is one of them.  By committing this patch (or some 
variation), we are saying Solr is going to support it.  Of course, there are 
other ways to do it, but that doesn't preclude it from L/S.  The combination of 
text search plus GIS search is very powerful, as you know. 
{quote}

Yes, and we tried solutions in the past that use unique doc ids to do joins 
between RDBMS used for geo search and Lucene used for the full text part. The 
biggest problem is, that this join operations are very inefficient if many 
documents are affected. Lucene as a full text engine has the great advantage to 
display the results very fast without retrieving the whole hits (you normally 
display only the best ranking ones). If you combine with data bases, you have 
to intersect the results in a HitCollector during filling the PriorityQueue. 
RDBMS have the problem to always have "transactions" around select statements 
and will only deliver the results, when the query is completely done. This puts 
an additional time lag. Doing the geo query completely in Lucene for our search 
in PANGAEA about a hundred of times faster in most cases (with TrieRange).

{quote}
Still, I think Yonik's main point is why reinvent the wheel when it comes to 
things like distributed search and the need for custom code for indexing, etc. 
when they likely can be handled through function queries and field types and 
therefore all of Solr's current functionality would just work.  The other 
capabilities (like sorting by a FunctionQuery) is icing on the cake that helps 
solve other problems as well.
{quote}

 I also agree about thinking to reimplement specific parts of the code, that 
may be done with "standard" Lucene/Solr tools (I would count TrieRange to that, 
even as it is not "standard" today - but its generic and not bound to geo and 
hopefully will move to Lucene Core as NumericRangeQuery & utils) easily.

In my opinion, LocalLucene should be as generic as possible and should not add 
too many custom datatypes, specific index structures, fixed field names etc. A 
problem of most GIS solutions for relational databases available on the world 
is, that you are fixed to specific database schemas. E.g. for our search at 
PANGAEA, we want to display the results of the Lucene Query also as Map. But 
for that you cannot use common GIS solution, because they do not know how to 
extract the data from Lucene.

Soon I will start a small project, to add a plugin to GeoServer's feature 
store, that does not use RDBMS or shape files or whatever for the features, 
instead use Lucene. Using that it may also be possible to retrieve the geo 
objects (in our case data sets with lat/lon) and display them in a WMS using 
OpenLayers, stream it to Google Earth using the Geoserver KML Streaming API 
(using TrieRange to support the bounding box filter) and so on.

About your benchmarks:
I suspect, that you have warmed up the readers, but I think you should get 
faster performace out of TrieRange. In my opinion, you should not use doubles 
for lat/lon, just use ints and scale the float lat/lon by multiply with 1E7 to 
get 7 decimal digits (which is surely enough for geo, 180*1E7 should be 
 Incorporate Local Lucene/Solr
> -
>
> Key: SOLR-773
> URL: https://issues.apache.org/jira/browse/SOLR-773
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
> SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
> SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
> SOLR-773.patch, spatial-solr.tar.gz
>
>
> Local Lucene has been donated to the Lucene project.  It has some Solr 
> components, but we should evaluate how best to incorporate it into Solr.
> See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JI

[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708633#action_12708633
 ] 

Uwe Schindler commented on SOLR-773:


Hi Patrick,

thanks for ding the comparison!
{quote}
As for bench marking, I have performed some recently using tdouble precision 0, 
~1 Million docs covering the state of NY
Top density was ~300,000 between Manhattan & Brooklyn area.
{quote}
I wonder, what you mean with precison 0, so what was the precision step? 2, 4 
or 8? precisionStep=0 should throw IAE, 64 should do a classical RangeQuery 
(enumerating all terms).

bq. And maybe switching the _localTier fields from sdouble to tdouble might 
improve that, I haven't tried, 12ms is something I can live with.
I think much faster will not be possible. Even with TrieRange you always have 
to visit TermDocs. And something other: as you only return 100 docs, the number 
of terms visited may not be so big. The speed improvement of TrieRange is more 
visible the more distinct values are in the range.

> Incorporate Local Lucene/Solr
> -
>
> Key: SOLR-773
> URL: https://issues.apache.org/jira/browse/SOLR-773
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
> SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
> SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
> SOLR-773.patch, spatial-solr.tar.gz
>
>
> Local Lucene has been donated to the Lucene project.  It has some Solr 
> components, but we should evaluate how best to incorporate it into Solr.
> See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708560#action_12708560
 ] 

Uwe Schindler edited comment on SOLR-773 at 5/12/09 12:01 PM:
--

bq. Also, how does the TrieRange stuff factor into this?

LocalLucene does something similar like TrieRange, but in two dimensions. It 
stores the Latitude and Longitude in one field as the number of a small 
rectangle (Cartesian tier) and the lower precision are simply bigger rectangles 
(I think they are quadrats). The effect is, that you only need one field name 
for the search, but you have the problem of limited precision.

TrieRange on the other side is more universal for any numeric searches and is 
not limited to Geo. The bounding box search in Solr as proposed in the issue 
can also be simply done with two ints (e.g. by scaling the lat/lon by a factor 
like 100 for 6 digits after decimal point) or float field TrieRangeQueries. 
Interesting would be a comparison in speed and index size between LocalLucene 
and TrieRange. Both can be simply done with Solr, but I had no time for it.

For our case (PANGAEA) we have another problem that is only solveable by 
TrieRange, not LocalLucene: Our Datasets itself are bounding boxes and if the 
user enters a bounding box, a hit is, if they intersect. This can be easily 
done with four half-open ranges. There is a small speed impact because of the 
half-open ranges that may hit very much TermDocs for the lower precs, but maybe 
I will create a special combined filter, that collects TermDocs only into one 
BitSet, so you can combine this ranges easily (but no idea, how to make an 
senseful API for that).

Another idea to use TrieRange for geo search is using a hilbert curve on the 
earth and just do a range around the position on this curve (look on the 
picture on http://en.wikipedia.org/wiki/Hilbert_curve then it is clear what the 
idea is). As far as I know, geohash is working with this hilbert curve (it's 
the position on this curve), so if you index the binary geohash as a long with 
TrieRange, you could do this range very simply (correct me if I am wrong!). The 
drawback is, that you will only find quadratic areas (so the use case is: find 
all phone cells around (lat,lon)).

In my opinion, I would recommend the following:
If you need standard queries like find all phone cells around a position, use 
LocalLucene. If you need full flexibility, just see lat/lon or whatever CRS 
(Gauss-Krüger etc.) as two numeric values, where you can do SQL-like "between", 
">", "<", ">=" and "<=" searches very fast.


  was (Author: thetaphi):
bq. Also, how does the TrieRange stuff factor into this?

LocalLucene does something similar like TrieRange, but in two dimensions. It 
stores the Latitude and Longitude in one field as the number of a small 
rectangle (Cartesian tier) and the lower precision are simply bigger rectangles 
(I think they are quadrats). The effect is, that you only need one field name 
for the search, but you have the problem of limited precision.

TrieRange on the other side is more universal for any numeric searches and is 
not limited to Geo. The bounding box search in Solr as proposed in the issue 
can also be simply done with two long (e.g. by scaling the lat/lon by a factor 
<1) or float field TrieRangeQueries. Interesting would be a comparison in speed 
and index size between LocalLucene and TrieRange. Both can be simply done with 
Solr, but I had no time for it.

For our case (PANGAEA) we have another problem that is only solveable by 
TrieRange, not LocalLucene: Our Datasets itself are bounding boxes and if the 
user enters a bounding box, a hit is, if they intersect. This can be easily 
done with two half-open ranges. There is a small speed impact because of the 
half-open ranges that may hit very much TermDocs for the lower precs, but maybe 
I will create a special combined filter, that collects TermDocs only into one 
BitSet, so you can combine this ranges easily (but no idea, how to make an 
senseful API for that).

Another idea to use TrieRange for geo search is using a hilbert curve on the 
earth and just do a range around the position on this curve (look on the 
picture on http://en.wikipedia.org/wiki/Hilbert_curve then it is clear what the 
idea is). As far as I know, geohash is working with this hilbert curve (it's 
the position on this curve), so if you index the binary geohash as a long with 
TrieRange, you could do this range very simply (correct me if I am wrong!). The 
drawback is, that you will only find quadratic areas (so the use case is: find 
all phone cells around (lat,lon)).

In my opinion, I would recommend the following:
If you need standard queries like find all phone cells around a position, use 
LocalLucene. If you need full flexibility, just see lat/lon or whatever CRS 
(Gauss-Krüger etc.) as two numeric values, whe

[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr

2009-05-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708560#action_12708560
 ] 

Uwe Schindler commented on SOLR-773:


bq. Also, how does the TrieRange stuff factor into this?

LocalLucene does something similar like TrieRange, but in two dimensions. It 
stores the Latitude and Longitude in one field as the number of a small 
rectangle (Cartesian tier) and the lower precision are simply bigger rectangles 
(I think they are quadrats). The effect is, that you only need one field name 
for the search, but you have the problem of limited precision.

TrieRange on the other side is more universal for any numeric searches and is 
not limited to Geo. The bounding box search in Solr as proposed in the issue 
can also be simply done with two long (e.g. by scaling the lat/lon by a factor 
<1) or float field TrieRangeQueries. Interesting would be a comparison in speed 
and index size between LocalLucene and TrieRange. Both can be simply done with 
Solr, but I had no time for it.

For our case (PANGAEA) we have another problem that is only solveable by 
TrieRange, not LocalLucene: Our Datasets itself are bounding boxes and if the 
user enters a bounding box, a hit is, if they intersect. This can be easily 
done with two half-open ranges. There is a small speed impact because of the 
half-open ranges that may hit very much TermDocs for the lower precs, but maybe 
I will create a special combined filter, that collects TermDocs only into one 
BitSet, so you can combine this ranges easily (but no idea, how to make an 
senseful API for that).

Another idea to use TrieRange for geo search is using a hilbert curve on the 
earth and just do a range around the position on this curve (look on the 
picture on http://en.wikipedia.org/wiki/Hilbert_curve then it is clear what the 
idea is). As far as I know, geohash is working with this hilbert curve (it's 
the position on this curve), so if you index the binary geohash as a long with 
TrieRange, you could do this range very simply (correct me if I am wrong!). The 
drawback is, that you will only find quadratic areas (so the use case is: find 
all phone cells around (lat,lon)).

In my opinion, I would recommend the following:
If you need standard queries like find all phone cells around a position, use 
LocalLucene. If you need full flexibility, just see lat/lon or whatever CRS 
(Gauss-Krüger etc.) as two numeric values, where you can do SQL-like "between", 
">", "<", ">=" and "<=" searches very fast.


> Incorporate Local Lucene/Solr
> -
>
> Key: SOLR-773
> URL: https://issues.apache.org/jira/browse/SOLR-773
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, 
> SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, 
> SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, 
> SOLR-773.patch, spatial-solr.tar.gz
>
>
> Local Lucene has been donated to the Lucene project.  It has some Solr 
> components, but we should evaluate how best to incorporate it into Solr.
> See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-04-23 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-LUCENE-1602.patch

I modified the patch a little bit to also include an updated documentation 
about sorting and function queries.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-LUCENE-1602.patch, SOLR-940-LUCENE-1602.patch, 
> SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699694#action_12699694
 ] 

Uwe Schindler commented on SOLR-940:


Again a change
*TrieRangeQuery is now available as separate class, *TrieRangeFilter is not 
needed for Solr range queries (LUCENE-1602). It has now equal sematics liek 
RangeQuery and can also be switched between constant score and boolean query 
rewrite.
The next change will be the move to core, package renames and a possibly new 
name NumericRangeQuery in Lucene core (see java-...@lucene discussions). Stay 
tuned.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698188#action_12698188
 ] 

Uwe Schindler commented on SOLR-940:


bq. I'm also not very familiar with that code in QueryComponent but I guess 
that is executed only when field-sort-values are requested (for distributed 
search). I wrote tests for sorting and it works fine! So I think the problem 
will only be during Distributed Search. I'll modify TestDistributedSearch to 
test sorting of trie fields to be sure. If it doesn't, I'll open another issue 
to replace the deprecated ScoreDocComparator with FieldComparator.

OK. If distributed search does not work, the problems are bigger: The problem 
is not the comparator alone, the problem is the FieldCache. The distributed 
search should fill the values into FieldCache and then let the comparator do 
the work. Comparing lucenes code with the solr ones shows, that there are some 
parts of LUCENE-1478 missing. The Comparators use the default parser instead of 
the one given in SortField.getParser() to parse the values (when retrieving 
FieldCache.getInts() & Co).

I am not really sure, why Solr needs to duplicate the sorting code from Lucene? 
Maybe this is no longer needed? In this case, everything would be ok when 
removed.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696681#action_12696681
 ] 

Uwe Schindler commented on SOLR-940:


One note to sorting:
I am not really sure, if sorting works with Solr. The Sortfield returned by 
TrieUtils.getSortField contains an own parser (new feature in Lucene 2.9). When 
looking through the solr code, searching for SortField in trunk, I noticed, 
that QueryComponent has own comparators and FieldCache code (duplicating the 
Lucene code), and ignoring the parser given in SortField (the parser is not 
passed to FieldCache.getInts() & Co.).

If this is the case, it will simply not work. As I do not know anything about 
the internals of Solr and what QueryComponent does, so can you create a 
test-case that tests sorting of trie fields?

By the way: In QueryComponent is a package-private StringFieldable just to 
convert the strings. Why not simply use a conventional Field instance to do 
this, why implement the whole interface? You can do everything done with this 
StringFieldable with Field, too. This is the problem of the omitTf thing: the 
interface changed again in Lucene 2.9, needing a change in this class. 
Replacing this by a simple reuseable Field instance solves the interface 
problem completely.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696702#action_12696702
 ] 

Uwe Schindler commented on SOLR-940:


I attached a patch to SOLR-1079 to fix the QueryComponent problem (remove the 
StringFieldable).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1079) Rename omitTf to omitTermFreqAndPositions

2009-04-07 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1079:


Attachment: SOLR-1079-fixcompilation.patch

This is a first patch to remove the dummy implementation of Fieldable in 
QueryComponent (its not needed). This private class prevents the usage of 
Lucene 2.9 trunk without modifying the source (as the interface changed again 
in 2.9).
This patch just removes the FieldAble implementation colpletely and uses a 
conventional Out-Of-The-Box-Field to do the same. It is initialized like the 
StringFieldable at the beginning an then reused. The fieldname and initial 
value are just dummies and Field.Store.YES is used to make the Field c'tor 
happy.

All tests pass.

> Rename omitTf to omitTermFreqAndPositions
> -
>
> Key: SOLR-1079
> URL: https://issues.apache.org/jira/browse/SOLR-1079
> Project: Solr
>  Issue Type: Improvement
>  Components: documentation, update
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1079-fixcompilation.patch, SOLR-1079.patch
>
>
> LUCENE-1561 has renamed omitTf.
> See 
> http://www.lucidimagination.com/search/document/376c1c12dd464164/lucene_1561_and_omittf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696492#action_12696492
 ] 

Uwe Schindler commented on SOLR-940:


The change is now committed in Lucene trunk!
Shalin: Can you reopen this issue (I cannot do this), to not forget about it?

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-04-05 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-newTrieAPI.patch

Updated patch, that supports ValueSource (currently not for Date Trie fields, I 
do not know how this should work, the orginal DateField uses a StringIndex as 
ValueSource, which is not possible for trie date fields, as no parser available 
and if using the standard string index, would fail because of more than one 
term/doc). Some tests for function queries are needed (especially as Double and 
FloatParser are not tested by Lucene at the moment), maybe change a test for 
conventional XxxFields to do the same test with a trie field.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-newTrieAPI.patch, 
> SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, SOLR-940-test.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695766#action_12695766
 ] 

Uwe Schindler commented on SOLR-940:


bq. I'm having trouble applying the patch:
I created the patch from the SVN trunk checkout yesterday. Maybe it is in 
windows-format with CR-LF. For me it applies cleanly using TortoiseSVN merge 
function.

Did the LUCENE-1582 patch apply to Lucene correctly?

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695761#action_12695761
 ] 

Uwe Schindler commented on SOLR-940:


I forget to mention: with LUCENE-1582 and this patch, sorting now works for 
trie fields. I changed the schema.xml in the patch to note this.

About function queries: If they use the "normal" field cache (long, int, 
double, float) with the supplied trie parser (as the trie SortField factory 
does), it would work. The parser for the nurmeric values is also separately 
available in TrieUtils. But I do not know, how to enable this in Solr 
(SortField support is available through the schema), maybe you can do this, or 
change the comments.

By the way, the change needed for compilation with the new Lucene JARs is the 
omitTf thing (SOLR-1079), I have done this in my local checkout to be able to 
create this patch.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695581#action_12695581
 ] 

Uwe Schindler commented on SOLR-940:


This patch modifies Solr support for trie fields to the new Trie API (not 
committed until now).
This class simplifies the TokenizerFactories (no Solr-internal indexing 
Tokenizer needed anymore as trie API supplies TokenStream). The 
TrieQueryTokenizerFactory was simplified to use KeywordTokenizer instead of 
implementing an own one (this change can be left of, if you like your solution 
more).
For this to compile and work, the latest trunk builds of Lucene must be placed 
in lib and another small change because of a change in Fieldable interface must 
be added (not included in patch).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-940) TrieRange support

2009-04-03 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-940:
---

Attachment: SOLR-940-newTrieAPI.patch

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-newTrieAPI.patch, SOLR-940-rangequery.patch, 
> SOLR-940-rangequery.patch, SOLR-940-test.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-04-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694975#action_12694975
 ] 

Uwe Schindler commented on SOLR-940:


I created a new issue LUCENE-1582 to fix the sorting problem and also support a 
TokenStream directly by trieCodeLong/Int(). The API will change, but this would 
be simplification for the Solr implementation (as the TokenStream can be 
directly used) and is more memory efficient.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-940-rangequery.patch, SOLR-940-rangequery.patch, 
> SOLR-940-test.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678485#action_12678485
 ] 

Uwe Schindler commented on SOLR-940:


The patch is the same as before, maybe you uploaded the wrong one.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678367#action_12678367
 ] 

Uwe Schindler commented on SOLR-940:


Cool!
When looking through the code, I found out that TrieQueryTokenizer is missing 
Date support, nothing else! And I would always throw an 
IllegalArgumentException in the default case of all switch(type) statements. 
This helps finding such errors faster.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-940.patch, SOLR-940.patch, SOLR-940.patch, 
> SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-03-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677801#action_12677801
 ] 

Uwe Schindler commented on SOLR-940:


About the sorting problem:

As already discussed in the original TrieRange issue, the sorting is a problem 
for trie encoded fields. The problem is, that the current FieldCache has two 
problems:
 - it stores the *last* term (the last term in the TermEnum!) in the cache
 - it throws an exception, when the number of term in one field > the number of 
docs (I think this was the case)

For trie fields it would be good, to have something like "sorting on the first 
term of the document". This would be conformant with TrieRange, as the first 
term in trieCodeXxx() is always the highest precision one (and also in your 
tokenizer). I think, we should discuss more in LUCENE-1372, where this sorting 
problem is discussed. If it would be fixed before 2.9, I could remove the whole 
multi-field parts out of TrieRange API and only support one field name (with 
what I would be really happy). Then you can index all trie terms in one field 
and sort on it (if the order of generated trie terms is preserved through the 
whole indexing and TermDocs array (which is not really simple for the field 
cache to handle).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-03-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677799#action_12677799
 ] 

Uwe Schindler commented on SOLR-940:


Looks cool, great!
I have no Solr installed here to test in large scale, but from what I see, It 
seems sophisticated. I have only seen these points:
 - Missing support for half-open ranges with "\*" (just add the test for "\*" 
and pass null to TrieRangeFilter)
 - The example with a different configured precisionStep should use a 
precisionStep < 8 [16 is a possible value, but useless,because of number of 
terms. The possible number of terms increses dramatically with higher precision 
steps (factor 2^precisionStep). Javadocs should note, that 32/64 should be used 
for no additional trie fields]
 - Date support should be trivial, too.
 - Does it work with the tokenizer for standard term queries? e.g. somebody 
asks for all documents containing the long value x, but not using a TrieRange 
for that (this works, but can solr handle this?), is the value correctly 
tokenized? The problem here maybe that during parsing the query, the analyzer 
is used and generates a "OR" BolleanQuery of all terms incl lower precisions. 
Or is for the query another tokenizer used (but then this tokenizer should just 
generate one term using XxxxToPrefixCoded (without shift).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-940.patch, SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-28 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677740#action_12677740
 ] 

Uwe Schindler commented on SOLR-940:


Just one question:
In queryParser you use: FieldType ft = schema.getFieldType(field); So if you 
have the FieldType, why are you not able to extract the precisionStep from the 
schema? The user would only have a problem, if he changes the precision step in 
the schema, but with a fixed schema, that contains the precisionStep as a 
parameter, you should be able to search indexed data. If you change the schema, 
you have to reindex (or use a precisionStep that is a multiple of the original 
one, see trie Javadoc: if you have indexed with step 2, you can search without 
problems using step 4)

By the way: For future usage, you could use TrieUtils.get[Int|Long]SortField 
for FieldType.getSortField instead of using SortField.String. If the problem 
with more than one field name is solved, sorting works using the Trie-SortField 
using the correct parser.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-28 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677726#action_12677726
 ] 

Uwe Schindler commented on SOLR-940:


{quote}
bq. Assuming TrieRange does all the number mojo needed in lucene, should it 
eventually replace the existing number implementaions?

Not until we can support sorting. Also, trie indexes many tokens per value, 
increasing the index size. Users who do not need range searches should not pay 
this penalty.
{quote}

If the precisionStep is configureable, you can simply use 32 (for ints) or 64 
(for longs) to not create additional precisions.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
> Attachments: SOLR-940.patch
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676726#action_12676726
 ] 

Uwe Schindler commented on SOLR-940:


bq. But, a tokenizer cannot add tokens in another field which is requred for 
the filter to work correctly.

You can tokenize it into one field and use TrieRangeFilter with the same field 
name for the field and the lower precision field (second constructor). After 
that, search works, but you cannot sort anymore, because more than one token 
per document in this field.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676721#action_12676721
 ] 

Uwe Schindler commented on SOLR-940:


I would program this tokenizer in this way (using the old Lucene Token API):

{code}
public class TrieTokenStream extends TokenStream/Tokenizer {
  public TrieTokenStream(long value,...) {
this.trieVals=Arrays.asList(TrieUtils.trieCodeLong(value,...)).iterator();
  }

  public Token next(Token token) {
if (!s.hasNext()) return null;
token.reinit(trieVals.next(),0,0);
token.setPositionIncrement(0);
return token;
  }

  private final Iterator trieVals;
}
{code}

Using this, you could index the field (without an additional helper field and 
so not sortable) using the standard Lucene Fieldable mechanism. No further 
changes to solar on the indexing side might be needed.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676717#action_12676717
 ] 

Uwe Schindler commented on SOLR-940:


{quote}
bq. Just an idea (that came to me...): How about creating a TokenStream that 
returns the results of TrieUtils.trieCode[Long|Int]() with TokenIncrement 0. 
You should be able to search this with TrieRangeFilter (using the same field 
name for the highest and lower precision trie fields).

The difficulty is in identifying what type of tokenizer was used (TrieInt, 
TrieLong etc.) to index the field. The user will need to use the localparam 
syntax explicitly for us to use IntTrieRangeFilter e.g fq={trieint}tint:[10 TO 
100]. I would like to avoid the use of such syntax as far as possible. Creating 
the field type may be more work than this option, but it can help us use the 
correct Filter and SortField automatically.
{quote}

Now I understand the problem, Yonik had with the original TrieRange 
implementation and wanted to change the API. Your problem is, that you must be 
able to not just map the numerical value to *one* field and token. You have to 
index *one* numeric value to more than one token before indexing them.

My idea was, to just use create a FieldType subclass for indexing 
TrieRangeFilter and overwrite the getAnalyzer() and getQueryAnalyzer() methods. 
The analyzer would get the numerical value and create tokens from it. Normally, 
it would be only *one* token for numerical values that is converted using the 
to methods in FieldType. But now you have to create more than one token 
(one for each precision). This could be done by the analyzer that is returned 
by FieldType. This analyzer does really nothing, only returns a Tokenizer that 
does not really tokenize, it just returns Tokens containing the prefix encoded 
values of the given String converted to the numeric value in different 
precisions (using TrieUtils.trieCodeLong()).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676698#action_12676698
 ] 

Uwe Schindler commented on SOLR-940:


Just an idea (that came to me...): How about creating a TokenStream that 
returns the results of TrieUtils.trieCode[Long|Int]() with TokenIncrement 0. 
You should be able to search this with TrieRangeFilter (using the same field 
name for the highest and lower precision trie fields).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676643#action_12676643
 ] 

Uwe Schindler commented on SOLR-940:


By the way, when looking through the schema code, I found out, that with Lucene 
trunk, it is now also possible to sort the "SortableLongField" & others using 
the new SortField ctors that LUCENE-1478 introduced. Currently these fields are 
sorted by SortField.STRING, whcih is inefficient. Just as a side-note.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676635#action_12676635
 ] 

Uwe Schindler commented on SOLR-940:


bq. Yes, that seems to be the right way. I'll create TrieIntField and 
TrieLongField. We can use the implicit helper field or have it as a 
configuration option in schema.xml. We'd also need changes to the 
SolrQueryParser so that range queries on such fields are handled correctly.

And how about using this for floats, doubles, and dates (which also have 
corresponding Solr field types)? You could create field descriptions for that 
too (subclasses of TrieIntField and TrieLongField), to be able to index these 
types using trie.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676577#action_12676577
 ] 

Uwe Schindler commented on SOLR-940:


bq. So, I'll definitely need some help. My first priority is to get it working 
in a simple way, then add more configuration/tuning options depending on 
feedback

Just a question: Do you need help implementing (working power), or is the 
documentation not yet understandable for a beginner? I added some indexing and 
query examples in the package overview, but maybe it is not so easy for others 
to understand. Maybe we can improve the documentation.

I am not so familar with Solr internals, but as I understand you have datatypes 
and field configurations in your XML documents. Maybe you should add new types 
"trie-long",... and index them using TrieUtils. I will check out svn trunk of 
Solr and look into it. In the first step, I would only use the APIs taking 
*one* field name (which creates the internal helper field ending in "#trie", 
that would automatically be created but "invisible" to the user). This ensures 
simplicity and the possibility to sort efficient using the SortField factory 
from TrieUtils (without custom sort comparators and so on).

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-940) TrieRange support

2009-02-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676551#action_12676551
 ] 

Uwe Schindler commented on SOLR-940:


Cool, I am open for queries and requests about the API and can help where 
applicable. What do the Solr people think about LUCENE-1541? I keep it open, 
but I think it makes things to complicated.

> TrieRange support
> -
>
> Key: SOLR-940
> URL: https://issues.apache.org/jira/browse/SOLR-940
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
> Fix For: 1.4
>
>
> We need support in Solr for the new TrieRange Lucene functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.