[jira] Commented: (LUCENE-2358) rename KeywordMarkerTokenFilter

2010-03-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851659#action_12851659
 ] 

Steven Rowe commented on LUCENE-2358:
-

Sorry for cluttering this issue...

{quote}
I'm not really sure the KeywordAttribute is the best fit here, because its 
purpose is for the token
to not be changed by some later filter. I'm not sure how your filter works (I 
would have to see the patch),
but I think using this attribute for this purpose could introduce some bugs?

I guess the key is that its not a private-use attribute really, these things 
are visible by all tokenstreams.
so stemmers etc will see your 'internal' attribute.
{quote}

Yep, you're right, I hadn't thought it through that far.

{quote}
bq. Would it make sense to have a generalized boolean attribute [...]?

I don't really think so. Since there can only be one of any attribute in the 
tokenstream, you would have
various TokenFilters clashing on how they interpret and use some generic 
boolean attribute!
{quote}

Um, yes, I should have realized that...

(Re-writing private FillerTokenAttribute! Hooray!)

> rename KeywordMarkerTokenFilter
> ---
>
> Key: LUCENE-2358
> URL: https://issues.apache.org/jira/browse/LUCENE-2358
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Analysis
>Reporter: Robert Muir
>Priority: Trivial
> Attachments: LUCENE-2358.patch
>
>
> I would like to rename KeywordMarkerTokenFilter to KeywordMarkerFilter.
> We havent released it yet, so its a good time to keep the name brief and 
> consistent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2358) rename KeywordMarkerTokenFilter

2010-03-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851652#action_12851652
 ] 

Steven Rowe commented on LUCENE-2358:
-

Hi Robert,

I'm working on a change to ShingleFilter to not output "_" filler token 
unigrams (or generally, filler-only ngrams, to cover the case where position 
increment gaps exceed n).  

I needed to be able to mark cached tokens as being filler tokens (or not) - a 
boolean attribute.  After trying to write a new private-use attribute and 
failing (I didn't make both an interface and an implementation, I think - I 
should figure it out and improve the docs I guess), I found KeywordAttribute 
and have used it to mark whether or not a cached token is a filler token 
(keyword:yes => filler-token:yes).

Would it make sense to have a generalized boolean attribute, specialized for 
keywords or (fill-in-the-blank)?  It's a small leap to say that "iskeyword" 
means true for whatever boolean attribute you want to carry, so this isn't 
really a big deal, but I thought I'd bring it up while you're thinking about 
naming this thing.

(This may be a can of worms: if there  is a generic boolean attribute, should 
there be generic string/int/float/etc. attributes too?)

Steve


> rename KeywordMarkerTokenFilter
> ---
>
> Key: LUCENE-2358
> URL: https://issues.apache.org/jira/browse/LUCENE-2358
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Analysis
>Reporter: Robert Muir
>Priority: Trivial
> Attachments: LUCENE-2358.patch
>
>
> I would like to rename KeywordMarkerTokenFilter to KeywordMarkerFilter.
> We havent released it yet, so its a good time to keep the name brief and 
> consistent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842505#action_12842505
 ] 

Steven Rowe commented on LUCENE-2302:
-

bq. A CollationFilter will not be needed anymore after that change, as any 
Tokenizer-Chain that wants to use collation can simply supply a special 
AttributeFactory to the ctor, that creates a special TermAttributeImpl class 
with modified getBytesRef(). 

Mike M. noted on 
[LUCENE-1435|http://issues.apache.org/jira/browse/LUCENE-1435?focusedCommentId=12646667&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12646667]
 that the way to do "internal-to-indexing" collation is to store the original 
string in the term dictionary, sorted via user-specifiable collation.

> Replacement for TermAttribute+Impl with extended capabilities (byte[] 
> support, CharSequence, Appendable)
> 
>
> Key: LUCENE-2302
> URL: https://issues.apache.org/jira/browse/LUCENE-2302
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
> Fix For: Flex Branch
>
>
> For flexible indexing terms can be simple byte[] arrays, while the current 
> TermAttribute only supports char[]. This is fine for plain text, but e.g 
> NumericTokenStream should directly work on the byte[] array.
> Also TermAttribute lacks of some interfaces that would make it simplier for 
> users to work with them: Appendable and CharSequence
> I propose to create a new interface "CharTermAttribute" with a clean new API 
> that concentrates on CharSequence and Appendable.
> The implementation class will simply support the old and new interface 
> working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
> this. So if somebody adds a TermAttribute, he will get an implementation 
> class that can be also used as CharTermAttribute. As both attributes create 
> the same impl instance both calls to addAttribute are equal. So a TokenFilter 
> that adds CharTermAttribute to the source will work with the same instance as 
> the Tokenizer that requested the (deprecated) TermAttribute.
> To also support byte[] only terms like Collation or NumericField needs, a 
> separate getter-only interface will be added, that returns a reusable 
> BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
> also support this interface. For backwards compatibility with old 
> self-made-TermAttribute implementations, the indexer will check with 
> hasAttribute(), if the BytesRef getter interface is there and if not will 
> wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
> new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
> indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838081#action_12838081
 ] 

Steven Rowe commented on LUCENE-2167:
-

I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and 
both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the 
Unicode test files; run the generated scanner against each composed test 
string; output the break opportunities/prohibitions in the same format as the 
test files; and then finally compare the output against the test file itself, 
looking for a match.  (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a 
StandardTokenizer replacement, but you can get an idea from them how to 
implement the Unicode word break rules in (as yet unreleased version 1.5.0) 
JFlex syntax.

> StandardTokenizer Javadoc does not correctly describe tokenization around 
> punctuation characters
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Shyamal Prasad
>Priority: Minor
> Attachments: LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The Javadoc for StandardTokenizer states:
> {quote}
> Splits words at punctuation characters, removing punctuation. 
> However, a dot that's not followed by whitespace is considered part of a 
> token.
> Splits words at hyphens, unless there's a number in the token, in which case 
> the whole 
> token is interpreted as a product number and is not split.
> {quote}
> This is not accurate. The actual JFlex implementation treats hyphens 
> interchangeably with
> punctuation. So, for example "video,mp4,test" results in a *single* token and 
> not three tokens
> as the documentation would suggest.
> Additionally, the documentation suggests that "video-mp4-test-again" would 
> become a single
> token, but in reality it results in two tokens: "video-mp4-test" and "again".
> IMHO the parser implementation is fine as is since it is hard to keep 
> everyone happy, but it is probably
> worth cleaning up the documentation string. 
> The patch included here updates the documentation string and adds a few test 
> cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838081#action_12838081
 ] 

Steven Rowe edited comment on LUCENE-2167 at 2/24/10 11:27 PM:
---

I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and 
both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are {{UnicodeWordBreakRules_5_\*.\*}} - these are written to: parse 
the Unicode test files; run the generated scanner against each composed test 
string; output the break opportunities/prohibitions in the same format as the 
test files; and then finally compare the output against the test file itself, 
looking for a match.  (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a 
StandardTokenizer replacement, but you can get an idea from them how to 
implement the Unicode word break rules in (as yet unreleased version 1.5.0) 
JFlex syntax.

  was (Author: steve_rowe):
I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT 
and both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the 
Unicode test files; run the generated scanner against each composed test 
string; output the break opportunities/prohibitions in the same format as the 
test files; and then finally compare the output against the test file itself, 
looking for a match.  (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a 
StandardTokenizer replacement, but you can get an idea from them how to 
implement the Unicode word break rules in (as yet unreleased version 1.5.0) 
JFlex syntax.
  
> StandardTokenizer Javadoc does not correctly describe tokenization around 
> punctuation characters
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Shyamal Prasad
>Priority: Minor
> Attachments: LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The Javadoc for StandardTokenizer states:
> {quote}
> Splits words at punctuation characters, removing punctuation. 
> However, a dot that's not followed by whitespace is considered part of a 
> token.
> Splits words at hyphens, unless there's a number in the token, in which case 
> the whole 
> token is interpreted as a product number and is not split.
> {quote}
> This is not accurate. The actual JFlex implementation treats hyphens 
> interchangeably with
> punctuation. So, for example "video,mp4,test" results in a *single* token and 
> not three tokens
> as the documentation would suggest.
> Additionally, the documentation suggests that "video-mp4-test-again" would 
> become a single
> token, but in reality it results in two tokens: "video-mp4-test" and "again".
> IMHO the parser implementation is fine as is since it is hard to keep 
> everyone happy, but it is probably
> worth cleaning up the documentation string. 
> The patch included here updates the documentation string and adds a few test 
> cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2218) ShingleFilter improvements

2010-01-29 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806565#action_12806565
 ] 

Steven Rowe edited comment on LUCENE-2218 at 1/29/10 11:48 PM:
---

Solr support for the ShingleFilter improvements implemented here: SOLR-1740

  was (Author: steve_rowe):
Solr support for the ShingleFilter improvements implemented here
  
> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-29 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806409#action_12806409
 ] 

Steven Rowe commented on LUCENE-2218:
-

I see that SOLR-1674 introduced a new class TestShingleFilterFactory, but 
SOLR-1657 doesn't have any changes to ShingleFilterFactory, and your list in 
the description doesn't include it.

Are there other Solr-Lucene-3.0-analysis issues I'm missing?

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>    Affects Versions: 3.0
>Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-29 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806402#action_12806402
 ] 

Steven Rowe commented on LUCENE-2218:
-

Thanks, Robert.

I plan on creating a Solr issue to integrate these ShingleFilter changes into 
ShingleFilterFactory.  I haven't followed your (and others') work moving Solr 
closer to upgrading to Lucene 3.0 - are there issues with that that I should be 
aware of?

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2223) ShingleFilter benchmark

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801692#action_12801692
 ] 

Steven Rowe edited comment on LUCENE-2223 at 1/18/10 7:13 AM:
--

bq. This appears to work well, the only thing I would ask for is a simple test 
for the task (maybe especially testing the option that changes the wrapped 
analyzer's classname from the default std. analyzer)

Done in attached patch - thanks for catching this oversight.

In constructing the test, I noticed that I had not brought over the analyzer 
package abbreviation logic from NewAnalyzerTask; this is now present in 
NewShingleAnalyzerTask, so that "analyzer:WhitespaceAnalyzer" is functional as 
a param.

*Edit*: Also removed some debug printing I'd forgotten to remove from 
NewShingleAnalyzerTask.

  was (Author: steve_rowe):
bq. This appears to work well, the only thing I would ask for is a simple 
test for the task (maybe especially testing the option that changes the wrapped 
analyzer's classname from the default std. analyzer)

Done in attached patch - thanks for catching this oversight.

In constructing the test, I noticed that I had not brought over the analyzer 
package abbreviation logic from NewAnalyzerTask; this is now present in 
NewShingleAnalyzerTask, so that "analyzer:WhitespaceAnalyzer" is functional as 
a param.
  
> ShingleFilter benchmark
> ---
>
> Key: LUCENE-2223
> URL: https://issues.apache.org/jira/browse/LUCENE-2223
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2223.patch, LUCENE-2223.patch
>
>
> Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new 
> task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: 
> NewShingleAnalyzerTask.
> The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default 
> StandardAnalyzer, with 4 different configurations over 10,000 Reuters 
> documents each.  To allow ShingleFilter timings to be isolated from the rest 
> of the pipeline, StandardAnalyzer is also run over the same set of Reuters 
> documents.  This set of 5 runs is then run 5 times.
> The patch includes two perl scripts, the first to output JIRA table formatted 
> timing information, with the minimum elapsed time for each of the 4 
> ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to 
> compare two runs' JIRA output, producing another JIRA table showing % 
> improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2223) ShingleFilter benchmark

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2223:


Attachment: LUCENE-2223.patch

bq. This appears to work well, the only thing I would ask for is a simple test 
for the task (maybe especially testing the option that changes the wrapped 
analyzer's classname from the default std. analyzer)

Done in attached patch - thanks for catching this oversight.

In constructing the test, I noticed that I had not brought over the analyzer 
package abbreviation logic from NewAnalyzerTask; this is now present in 
NewShingleAnalyzerTask, so that "analyzer:WhitespaceAnalyzer" is functional as 
a param.

> ShingleFilter benchmark
> ---
>
> Key: LUCENE-2223
> URL: https://issues.apache.org/jira/browse/LUCENE-2223
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2223.patch, LUCENE-2223.patch
>
>
> Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new 
> task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: 
> NewShingleAnalyzerTask.
> The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default 
> StandardAnalyzer, with 4 different configurations over 10,000 Reuters 
> documents each.  To allow ShingleFilter timings to be isolated from the rest 
> of the pipeline, StandardAnalyzer is also run over the same set of Reuters 
> documents.  This set of 5 runs is then run 5 times.
> The patch includes two perl scripts, the first to output JIRA table formatted 
> timing information, with the minimum elapsed time for each of the 4 
> ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to 
> compare two runs' JIRA output, producing another JIRA table showing % 
> improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801651#action_12801651
 ] 

Steven Rowe commented on LUCENE-2218:
-

bq. I made a trivial change: shingleFilterTestCommon is implemented with 
assertTokenStreamContents, for better checking. It now recently does some good 
sanity checks for things like clearAttributes, even with save/restore state, 
etc. no change to the code, tests all still pass.

Cool, thanks.  

FYI, you named your patch LUCENE-2118.patch instead of LUCENE-2218.patch.

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801649#action_12801649
 ] 

Steven Rowe edited comment on LUCENE-2218 at 1/18/10 2:20 AM:
--

bq. hey, want to break the benchmark out into a separate jira issue for 
simplicity? 

Done - see LUCENE-2223.

Deleted benchmark patches from this issue.

  was (Author: steve_rowe):
bq. hey, want to break the benchmark out into a separate jira issue for 
simplicity? 

Done - see LUCENE-2223.
  
> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: (was: LUCENE-2218.benchmark.patch)

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: (was: LUCENE-2218.benchmark.patch)

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: (was: LUCENE-2218.benchmark.patch)

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801649#action_12801649
 ] 

Steven Rowe commented on LUCENE-2218:
-

bq. hey, want to break the benchmark out into a separate jira issue for 
simplicity? 

Done - see LUCENE-2223.

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2118.patch, LUCENE-2218.benchmark.patch, 
> LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2223) ShingleFilter benchmark

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2223:


Attachment: LUCENE-2223.patch

ShingleFilter benchmark patch attached.  Use "ant shingle" to produce JIRA 
table formatted output.

> ShingleFilter benchmark
> ---
>
> Key: LUCENE-2223
> URL: https://issues.apache.org/jira/browse/LUCENE-2223
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2223.patch
>
>
> Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new 
> task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: 
> NewShingleAnalyzerTask.
> The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default 
> StandardAnalyzer, with 4 different configurations over 10,000 Reuters 
> documents each.  To allow ShingleFilter timings to be isolated from the rest 
> of the pipeline, StandardAnalyzer is also run over the same set of Reuters 
> documents.  This set of 5 runs is then run 5 times.
> The patch includes two perl scripts, the first to output JIRA table formatted 
> timing information, with the minimum elapsed time for each of the 4 
> ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to 
> compare two runs' JIRA output, producing another JIRA table showing % 
> improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2223) ShingleFilter benchmark

2010-01-17 Thread Steven Rowe (JIRA)
ShingleFilter benchmark
---

 Key: LUCENE-2223
 URL: https://issues.apache.org/jira/browse/LUCENE-2223
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Affects Versions: 3.0
Reporter: Steven Rowe
Priority: Minor


Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new task 
to instantiate (non-default-constructor) ShingleAnalyzerWrapper: 
NewShingleAnalyzerTask.

The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default 
StandardAnalyzer, with 4 different configurations over 10,000 Reuters documents 
each.  To allow ShingleFilter timings to be isolated from the rest of the 
pipeline, StandardAnalyzer is also run over the same set of Reuters documents.  
This set of 5 runs is then run 5 times.

The patch includes two perl scripts, the first to output JIRA table formatted 
timing information, with the minimum elapsed time for each of the 4 
ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to 
compare two runs' JIRA output, producing another JIRA table showing % 
improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801503#action_12801503
 ] 

Steven Rowe commented on LUCENE-2218:
-

I think these patches are now ready to go.

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, 
> LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801502#action_12801502
 ] 

Steven Rowe commented on LUCENE-2218:
-

New output from the fixed benchmark script - no change in the ShingleFilter 
patch:

JAVA:
java version "1.5.0_15"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle 
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|5.03s|4.62s|2.18s|16.8%|
|2|yes|5.20s|4.84s|2.18s|13.5%|
|4|no|6.42s|5.70s|2.18s|20.5%|
|4|yes|6.53s|5.89s|2.18s|17.3%|


> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, 
> LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: LUCENE-2218.benchmark.patch

In {{compare.shingle.benchmark.tables.pl}}, a missing decimal point caused 
overinflated improvement figures.  Fixed in this patch.

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, 
> LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: LUCENE-2218.benchmark.patch

Output table produced by {{compare.shingle.benchmark.tables.pl}} now has "s" 
(for seconds) in the elapsed time columns

> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, 
> LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2218) ShingleFilter improvements

2010-01-17 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801341#action_12801341
 ] 

Steven Rowe edited comment on LUCENE-2218 at 1/17/10 5:17 PM:
--

The rewrite included some optimizations (e.g., no longer constructing n 
StringBuilders for every position in the input stream), and the performance is 
now modestly better - below is a comparison generated using the attached 
benchmark patch:

JAVA:
java version "1.5.0_15"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle 
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|4.92s|4.74s|2.19s|7.5%|
|2|yes|5.04s|4.90s|2.19s|5.6%|
|4|no|6.21s|5.82s|2.19s|11.2%|
|4|yes|6.41s|5.97s|2.19s|12.1%|


  was (Author: steve_rowe):
The rewrite included some optimizations (e.g., no longer constructing n 
StringBuilders for every position in the input stream), and the performance is 
now modestly better - below is a comparison generated using the attached 
benchmark patch:

JAVA:
java version "1.5.0_15"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle 
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|4.92|4.74|2.19|7.5%|
|2|yes|5.04|4.90|2.19|5.6%|
|4|no|6.21|5.82|2.19|11.2%|
|4|yes|6.41|5.97|2.19|12.1%|

  
> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, 
> LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801341#action_12801341
 ] 

Steven Rowe commented on LUCENE-2218:
-

The rewrite included some optimizations (e.g., no longer constructing n 
StringBuilders for every position in the input stream), and the performance is 
now modestly better - below is a comparison generated using the attached 
benchmark patch:

JAVA:
java version "1.5.0_15"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle 
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|4.92|4.74|2.19|7.5%|
|2|yes|5.04|4.90|2.19|5.6%|
|4|no|6.21|5.82|2.19|11.2%|
|4|yes|6.41|5.97|2.19|12.1%|


> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-16 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: LUCENE-2218.benchmark.patch
LUCENE-2218.patch

Patch implementing new features, and a patch for a new contrib/benchmark target 
"shingle", including a new task NewShingleAnalyzerTask.

ShingleFilter is largely rewritten here in order to support the new 
configurable minimum shingle size.



> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2218) ShingleFilter improvements

2010-01-16 Thread Steven Rowe (JIRA)
ShingleFilter improvements
--

 Key: LUCENE-2218
 URL: https://issues.apache.org/jira/browse/LUCENE-2218
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 3.0
Reporter: Steven Rowe
Priority: Minor


ShingleFilter should allow configuration of minimum shingle size (in addition 
to maximum shingle size), so that it's possible to (e.g.) output only trigrams 
instead of bigrams mixed with trigrams.  The token separator used in composing 
shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799056#action_12799056
 ] 

Steven Rowe commented on LUCENE-2181:
-

+1, once again, tests all pass, and "ant collation" produced expected output. 

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-11 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798679#action_12798679
 ] 

Steven Rowe commented on LUCENE-2181:
-

+1, tests all pass, and "ant collation" produced expected output.

One minor detail, though - shouldn't the output files be renamed to identify 
their purpose, similarly to how you renamed bm2jira.pl?  Here's the relevant 
section in {{contrib/benchmark/build.txt}}:

{code:xml}


{code}


> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798590#action_12798590
 ] 

Steven Rowe commented on LUCENE-2181:
-

{quote}
Steven I also havent forgotten about your other contribution, the thing that 
creates the benchmark corpus in the first place from wikipedia.

One idea I had would be that such a thing wouldn't be too out of place in the 
open relevance project... (munging corpora etc)
{quote}

Interesting idea, thanks - I'll take a look at what's there now and see how my 
stuff would fit in.

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798589#action_12798589
 ] 

Steven Rowe commented on LUCENE-2181:
-

I think NewCollationAnalyzerTask should be a little more careful about parsing 
its parameters - here's a slightly modified version of your setParams() that 
understands "impl:jdk" and complains about unrecognized params:

{code:java}
@Override
  public void setParams(String params) {
super.setParams(params);

StringTokenizer st = new StringTokenizer(params, ",");
while (st.hasMoreTokens()) {
  String param = st.nextToken();
  StringTokenizer expr = new StringTokenizer(param, ":");
  String key = expr.nextToken();
  String value = expr.nextToken();
  // for now we only support the "impl" parameter.
  // TODO: add strength, decomposition, etc
  if (key.equals("impl")) {
if (value.equalsIgnoreCase("icu"))
  impl = Implementation.ICU;
else if (value.equalsIgnoreCase("jdk"))
  impl = Implementation.JDK;
else
  throw new RuntimeException("Unknown parameter " + param);
  } else {
throw new RuntimeException("Unknown parameter " + param);
  }
}
  }
{code}

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798588#action_12798588
 ] 

Steven Rowe commented on LUCENE-2181:
-

I just ran the contrib/benchmark tests, and I got one test failure:

{noformat}
[junit] Testcase: 
testReadTokens(org.apache.lucene.benchmark.byTask.TestPerfTasksLogic):FAILED
[junit] expected:<3108> but was:<3128>
[junit] junit.framework.AssertionFailedError: expected:<3108> but was:<3128>
[junit] at 
org.apache.lucene.benchmark.byTask.TestPerfTasksLogic.testReadTokens(TestPerfTasksLogic.java:480)
[junit] at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:212)
[junit] 
[junit] 
[junit] Test org.apache.lucene.benchmark.byTask.TestPerfTasksLogic FAILED
{noformat}


> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798584#action_12798584
 ] 

Steven Rowe commented on LUCENE-2181:
-

Works for me:

JAVA:
java version "1.5.0_15"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|English|5.53s|2.03s|1.20s|422%|
|French|6.41s|2.13s|1.19s|455%|
|German|6.36s|2.19s|1.22s|430%|
|Ukrainian|8.92s|3.62s|1.21s|220%|


> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, 
> LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-10 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2200:


Attachment: LUCENE-2200.patch

bq. Robert, when you commit this make sure you mark the Attributes in 
EdgeNGramTokenFilter.java final thanks.

Whoops, I missed those - thanks for checking, Simon.  (minGram and maxGram can 
also be final in EdgeNGramTokenFilter.java.)

I've attached a new patch that includes these changes -- all tests pass.


> Several final classes have non-overriding protected members
> ---
>
> Key: LUCENE-2200
> URL: https://issues.apache.org/jira/browse/LUCENE-2200
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Assignee: Robert Muir
>Priority: Trivial
> Attachments: LUCENE-2200.patch, LUCENE-2200.patch, LUCENE-2200.patch
>
>
> Protected member access in final classes, except where a protected method 
> overrides a superclass's protected method, makes little sense.  The attached 
> patch converts final classes' protected access on fields to private, removes 
> two final classes' unused protected constructors, and converts one final 
> class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798520#action_12798520
 ] 

Steven Rowe commented on LUCENE-2181:
-

bq. Steven, another idea: what if we simply added the options to DocMaker so we 
could turn off the tokenization of title and date fields?

Good idea!

bq. i'll update the alg file and produce a new patch

Excellent, thanks!

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798519#action_12798519
 ] 

Steven Rowe edited comment on LUCENE-2181 at 1/10/10 5:56 PM:
--

bq. What about this per-field thing, what if in the data files, title and date 
were simply blank?

Hmm, although the date field value is meaningless, I like the TF-in-title-field 
thing.

{quote}
Or should we worry, I agree its stupid, does it skew the results though?
One way to look at it is that its also fairly realistic (even though its 
meaningless, you see numbers and dates everywhere).
{quote}

I was thinking that it would, and that it's not really a meaningful test of 
collation - who's going to bother running collation over integers and dates? - 
but since the comparison here is between two implementations of collation, I 
think you're right that there is no skew in doing this comparison:
{panel}
icu(kiwi) + icu(apple) + icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange)
{panel}
instead of this one:
{panel}
keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + 
jdk(orange)
{panel}
(where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for 
the jdk(X) transform)

bq. The downside to doing per-analyzer wrapper is that it introduces some 
complexity, in all honesty this is not really specific to this collation task, 
right? (i.e. the existing analysis/tokenization benchmarks have this same 
problem)

Yup, you're right.  A general facility to do this will end up looking (modulo 
syntax) like Solr's per-field analysis specification.

  was (Author: steve_rowe):
bq. What about this per-field thing, what if in the data files, title and 
date were simply blank?

Hmm, although the date field value is meaningless, I like the TF-in-title-field 
thing.

{quote}
Or should we worry, I agree its stupid, does it skew the results though?
One way to look at it is that its also fairly realistic (even though its 
meaningless, you see numbers and dates everywhere).
{quote}

I was thinking that it would, and that it's not really a meaningful test of 
collation - who's going to bother running collation over integers and dates? - 
but since the comparison here is between two implementations of collation, I 
think you're right that there is no skew in doing this comparison:
{panel}
icu(kiwi) + icu(apple) + (icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange)
{panel}
instead of this one:
{panel}
keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + 
jdk(orange)
{panel}
(where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for 
the jdk(X) transform)

bq. The downside to doing per-analyzer wrapper is that it introduces some 
complexity, in all honesty this is not really specific to this collation task, 
right? (i.e. the existing analysis/tokenization benchmarks have this same 
problem)

Yup, you're right.  A general facility to do this will end up looking (modulo 
syntax) like Solr's per-field analysis specification.
  
> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798519#action_12798519
 ] 

Steven Rowe commented on LUCENE-2181:
-

bq. What about this per-field thing, what if in the data files, title and date 
were simply blank?

Hmm, although the date field value is meaningless, I like the TF-in-title-field 
thing.

{quote}
Or should we worry, I agree its stupid, does it skew the results though?
One way to look at it is that its also fairly realistic (even though its 
meaningless, you see numbers and dates everywhere).
{quote}

I was thinking that it would, and that it's not really a meaningful test of 
collation - who's going to bother running collation over integers and dates? - 
but since the comparison here is between two implementations of collation, I 
think you're right that there is no skew in doing this comparison:
{panel}
icu(kiwi) + icu(apple) + (icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange)
{panel}
instead of this one:
{panel}
keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + 
jdk(orange)
{panel}
(where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for 
the jdk(X) transform)

bq. The downside to doing per-analyzer wrapper is that it introduces some 
complexity, in all honesty this is not really specific to this collation task, 
right? (i.e. the existing analysis/tokenization benchmarks have this same 
problem)

Yup, you're right.  A general facility to do this will end up looking (modulo 
syntax) like Solr's per-field analysis specification.

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798508#action_12798508
 ] 

Steven Rowe commented on LUCENE-2181:
-

Looks good.  I like the way you've integrated it into the benchmark suite, and 
as you say the NewLocaleTask should prove useful elsewhere.

bq. I put the files in my apache directory, but modified your patch somewhat

One major thing you changed but didn't mention above is that rather than 
applying the collation key transform only to the LineDoc body field, it's now 
applied also to the title and date fields.  Given the nature of the top 100k 
words files -- the title is an integer representing term frequency, and the 
date is essentially meaningless (the date on which I created the file) -- I 
don't think this makes sense (and that's why I made analyzers that only applied 
collation to the body field).

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2181) benchmark for collation

2010-01-09 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2181:


Attachment: top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
LUCENE-2181.patch

Hi Robert, 

In the new version of the patch, {{ant benchmark}} from the {{contrib/icu/}} 
directory attempts to download the attached {{tar.bz2}} file from 
{{http://people.apache.org/~rmuir/wikipedia}} (*please change this to the 
location where you end up putting the file*), then unpacks the archive to the 
{{contrib/icu/src/benchmark/work/}} directory, then compiles and runs the 
benchmark.

In addition to the top 100K word lists, the {{tar.bz2}} file contains 
{{LICENSE.txt}}, which contains links to the Wikipedia dumps from which the 
lists were extracted, along with a link to the license that Wikipedia uses.

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2181) benchmark for collation

2010-01-09 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2181:


Attachment: (was: LUCENE-2181.patch.zip)

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-09 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798396#action_12798396
 ] 

Steven Rowe commented on LUCENE-2200:
-

FYI, all tests pass for me with the new version of the patch applied.

> Several final classes have non-overriding protected members
> ---
>
> Key: LUCENE-2200
> URL: https://issues.apache.org/jira/browse/LUCENE-2200
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Priority: Trivial
> Attachments: LUCENE-2200.patch, LUCENE-2200.patch
>
>
> Protected member access in final classes, except where a protected method 
> overrides a superclass's protected method, makes little sense.  The attached 
> patch converts final classes' protected access on fields to private, removes 
> two final classes' unused protected constructors, and converts one final 
> class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-09 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2200:


Attachment: LUCENE-2200.patch

bq. Could we make some of the member vars final too?

Done in the new version of the patch.  Note that I didn't try to look in 
classes other than those already modified in the previous version of the patch 
for final class member access modification.

> Several final classes have non-overriding protected members
> ---
>
> Key: LUCENE-2200
> URL: https://issues.apache.org/jira/browse/LUCENE-2200
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Priority: Trivial
> Attachments: LUCENE-2200.patch, LUCENE-2200.patch
>
>
> Protected member access in final classes, except where a protected method 
> overrides a superclass's protected method, makes little sense.  The attached 
> patch converts final classes' protected access on fields to private, removes 
> two final classes' unused protected constructors, and converts one final 
> class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-09 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798370#action_12798370
 ] 

Steven Rowe commented on LUCENE-2200:
-

All tests pass with the attached patch applied.

> Several final classes have non-overriding protected members
> ---
>
> Key: LUCENE-2200
> URL: https://issues.apache.org/jira/browse/LUCENE-2200
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Priority: Trivial
> Attachments: LUCENE-2200.patch
>
>
> Protected member access in final classes, except where a protected method 
> overrides a superclass's protected method, makes little sense.  The attached 
> patch converts final classes' protected access on fields to private, removes 
> two final classes' unused protected constructors, and converts one final 
> class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-09 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2200:


Attachment: LUCENE-2200.patch

> Several final classes have non-overriding protected members
> ---
>
> Key: LUCENE-2200
> URL: https://issues.apache.org/jira/browse/LUCENE-2200
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Priority: Trivial
> Attachments: LUCENE-2200.patch
>
>
> Protected member access in final classes, except where a protected method 
> overrides a superclass's protected method, makes little sense.  The attached 
> patch converts final classes' protected access on fields to private, removes 
> two final classes' unused protected constructors, and converts one final 
> class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-09 Thread Steven Rowe (JIRA)
Several final classes have non-overriding protected members
---

 Key: LUCENE-2200
 URL: https://issues.apache.org/jira/browse/LUCENE-2200
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: Steven Rowe
Priority: Trivial


Protected member access in final classes, except where a protected method 
overrides a superclass's protected method, makes little sense.  The attached 
patch converts final classes' protected access on fields to private, removes 
two final classes' unused protected constructors, and converts one final 
class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-04 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796181#action_12796181
 ] 

Steven Rowe commented on LUCENE-2181:
-

{quote}
bq. ... these four files don't have Apache2 license declarations in them. We 
should put a README (or something like it) with these files to indicate the 
license.

Are they really apache license? or derived from wikipedia content?... I don't 
think we should be putting apache license headers in these files
{quote}

Hmm, I just assumed that since these files were not (anything even close to) 
verbatim copies that they were independently licensable new works, but it's 
definitely more complicated than that...

This looks like the place to start where licensing is concerned:

http://en.wikipedia.org/wiki/Wikipedia_Copyright

My (way non-expert) reading of this is that Wikipedia-derived works (and I'm 
pretty sure these frequency lists qualify as such) must be licensed under the 
[Creative Commons Attribution-Share Alike 3.0 Unported 
license|http://creativecommons.org/licenses/by-sa/3.0/], which does not appear 
to me to be entirely compatible with the Apache2 license.

So I agree with you :) - with the caveat that some form of attribution and a 
pointer to licensing info should be included with these files.


> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch.zip
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-03 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796017#action_12796017
 ] 

Steven Rowe commented on LUCENE-2181:
-

Works for me.

I do have one concern, though: the LineDocSource parser doesn't know how to 
handle comments, so these four files don't have Apache2 license declarations in 
them.  We should put a README (or something like it) with these files to 
indicate the license.

Different subject: I'm not sure where it would go, but the code I used to 
produce these top-TF wikipedia files may be useful to other people - where do 
you think it could live?  An example, maybe?

> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2181.patch.zip
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2181) benchmark for collation

2010-01-02 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2181:


Attachment: LUCENE-2181.patch.zip

Attached .zip'd patch (over 10MB because of the 4 languages' LineDocs) 
integrated into the Ant build for the ICU contrib, rather than integrated into 
the Benchmark build.

Invoke using {{ant benchmark}} from the {{contrib/icu/}} directory.


> benchmark for collation
> ---
>
> Key: LUCENE-2181
> URL: https://issues.apache.org/jira/browse/LUCENE-2181
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Robert Muir
> Attachments: LUCENE-2181.patch.zip
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2185) add @Deprecated annotations

2009-12-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795340#action_12795340
 ] 

Steven Rowe commented on LUCENE-2185:
-

The justification for using @Deprecated, AFAICT, is that conforming compilers 
are required to issue warnings for each so-annotated class/method, where 
compilers are *not* required to issue warnings for javadoc @deprecated tags, 
and although Sun compilers do this, other vendors' compilers might not.

Another (similarly theoretical) argument in favor of using @Deprecated 
annotations is that, unlike @deprecated javadoc tags, this annotation is 
available via runtime reflection.

A random information point: MYFACES-2135 removed all @Deprecated annotations 
from MyFaces code because an apparent bug in the Sun TCK flags methods bearing 
this annotation as changing method signatures.


> add @Deprecated annotations
> ---
>
> Key: LUCENE-2185
> URL: https://issues.apache.org/jira/browse/LUCENE-2185
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-2185.patch
>
>
> as discussed on LUCENE-2084, I think we should be consistent about use of 
> @Deprecated annotations if we are to use it.
> This patch adds the missing annotations... unfortunately i cannot commit this 
> for some time, because my internet connection does not support heavy 
> committing (it is difficult to even upload a large patch).
> So if someone wants to take it, have fun, otherwise in a week or so I will 
> commit it if nobody objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795337#action_12795337
 ] 

Steven Rowe commented on LUCENE-2084:
-

{quote}
bq. 3. Unlike getEncodedLength(byte[],int,int), 
getDecodedLength(char[],int,int) doesn't protect against overflow in the int 
multiplication by casting to long.

#3 concerns me somewhat, this is an existing problem in trunk (i guess only for 
enormous terms, though). Should we consider backporting a fix?
{quote}

The current form of this calculation will correctly handle original binary 
content of lengths up to 136MB.  IMHO the likelihood of encoding terms this 
enormous with IndexableBinaryStringTools is so miniscule that it's not worth 
the effort to backport.

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, 
> LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-27 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794762#action_12794762
 ] 

Steven Rowe commented on LUCENE-2084:
-

Hi Robert, I took a look at the patch and found a couple of minor issues:

# The newly deprecated methods should get @Deprecated annotations (in addition 
to the @deprecated javadoc tags)
# IntelliJ tells me that the "final" modifier on some of the public static 
methods is not cool - AFAICT, although static implies final, it may be useful 
to leave this anyway, since unlike the static modifier, the final modifier 
disallows hiding of the method by sublasses?  I dunno.  (Checking Lucene 
source, there are many "static final" methods, so maybe I should tell IntelliJ 
it's not a problem.)
# Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) 
doesn't protect against overflow in the int multiplication by casting to long.


> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, 
> TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794593#action_12794593
 ] 

Steven Rowe commented on LUCENE-2084:
-

bq. Do you think after this issue is resolved (whether it helps or doesn't 
help/won't fix either way) that we should open a separate issue to work on 
committing the benchmark so we have collation benchmarks for the future? Seems 
like it would be good to have in the future.

Yeah, I don't know quite how to do that - the custom code wrapping 
ICU/CollationKeyAnalyzer is necessary because the contrib benchmark alg format 
can only handle zero-argument analyzer constructors (or those that take Version 
arguments).  I think it would be useful to have a general-purpose alg syntax to 
handle this case without requiring custom code, and also, more generally, to 
allow for the construction of aribitrary analysis pipelines without requiring 
custom code (a la Solr schema).  The alg parsing code is a bit dense though - I 
think it could be converted to a JFlex-generated grammar to simplify this kind 
of syntax extension.

Can you think of an alternate way to package this benchmark that fits with 
current practice?

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2084:


Attachment: collation.benchmark.tar.bz2

Fixed up version of {{collation.benchmark.tar.bz2}} that removes printing of 
progress from the {{collation/run-benchmark.sh}} script - otherwise the same as 
the previous version.

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2084:


Attachment: (was: collation.benchmark.tar.bz2)

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2084.patch, LUCENE-2084.patch, 
> TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2084:


Attachment: TopTFWikipediaWords.tar.bz2

TopTFWikipediaWords.tar.bz2 contains a Maven2 project to parse unpacked 
Wikipedia dump files, create a Lucene index from the tokens produced by the 
contrib WikipediaTokenizer, iterate over the indexed tokens' term docs, 
accumulating term frequencies, store the results in a bounded priority queue, 
then output contrib benchmark LineDoc format, with the title field containing 
the collection term frequency, the date containing the date the file was 
generated, and the body containing the term text.

This code knows how to handle English, German, French, and Ukrainian, but could 
be extended for other languages.

I used this project to generate the line-docs for the 4 languages' 100k most 
frequent terms, in the collation benchmark archive attachment on this issue.

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794590#action_12794590
 ] 

Steven Rowe commented on LUCENE-2084:
-

Here are the unpatched results I got - these look quite similar to the results 
I posted from a custom (non-contrib-benchmark) benchmark in [the description of 
LUCENE-1719|https://issues.apache.org/jira/browse/LUCENE-1719#description-open] 
:

||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|1.5.0_15 (32-bit)|English|9.00s|4.89s|2.90s|207%|
|1.5.0_15 (32-bit)|French|10.64s|5.12s|2.95s|254%|
|1.5.0_15 (32-bit)|German|10.19s|5.19s|2.97s|225%|
|1.5.0_15 (32-bit)|Ukrainian|13.66s|7.20s|2.96s|152%|
||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|1.5.0_15 (64-bit)|English|5.97s|2.55s|1.50s|326%|
|1.5.0_15 (64-bit)|French|6.86s|2.74s|1.56s|349%|
|1.5.0_15 (64-bit)|German|6.85s|2.76s|1.59s|350%|
|1.5.0_15 (64-bit)|Ukrainian|9.56s|4.01s|1.56s|227%|
||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|1.6.0_13 (64-bit)|English|3.04s|2.06s|1.07s|99%|
|1.6.0_13 (64-bit)|French|3.58s|2.04s|1.14s|171%|
|1.6.0_13 (64-bit)|German|3.35s|2.22s|1.14s|105%|
|1.6.0_13 (64-bit)|Ukrainian|4.48s|2.94s|1.21s|89%|

Here are the results after applying the synced-to-trunk patch:

||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|1.5.0_15 (32-bit)|English|8.73s|4.61s|2.90s|241%|
|1.5.0_15 (32-bit)|French|10.38s|4.87s|2.94s|285%|
|1.5.0_15 (32-bit)|German|9.95s|4.94s|2.97s|254%|
|1.5.0_15 (32-bit)|Ukrainian|13.37s|6.91s|2.90s|161%|
||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|1.5.0_15 (64-bit)|English|5.78s|2.65s|1.57s|290%|
|1.5.0_15 (64-bit)|French|6.74s|2.74s|1.64s|364%|
|1.5.0_15 (64-bit)|German|6.69s|2.86s|1.66s|319%|
|1.5.0_15 (64-bit)|Ukrainian|9.40s|4.18s|1.62s|204%|
||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement||
|1.6.0_13 (64-bit)|English|3.06s|1.82s|1.09s|170%|
|1.6.0_13 (64-bit)|French|3.36s|1.88s|1.16s|206%|
|1.6.0_13 (64-bit)|German|3.40s|1.95s|1.14s|179%|
|1.6.0_13 (64-bit)|Ukrainian|4.33s|2.65s|1.21s|117%|

And here is a comparison of the two:

||Sun JVM||Language||java.text improvement||ICU4J improvement||
|1.5.0_15 (32-bit)|English|5.1%|16.8%|
|1.5.0_15 (32-bit)|French|3.8%|12.9%|
|1.5.0_15 (32-bit)|German|3.9%|13.1%|
|1.5.0_15 (32-bit)|Ukrainian|2.6%|6.2%|
||Sun JVM||Language||java.text improvement||ICU4J improvement||
|1.5.0_15 (64-bit)|English|6.6%|-2.2%|
|1.5.0_15 (64-bit)|French|4.4%|7.7%|
|1.5.0_15 (64-bit)|German|5.0%|-2.0%|
|1.5.0_15 (64-bit)|Ukrainian|3.3%|-3.7%|
||Sun JVM||Language||java.text improvement||ICU4J improvement||
|1.6.0_13 (64-bit)|English|0.5%|36.1%|
|1.6.0_13 (64-bit)|French|11.4%|25.5%|
|1.6.0_13 (64-bit)|German|-1.7%|33.8%|
|1.6.0_13 (64-bit)|Ukrainian|5.3%|20.6%|

It's not unequivocal, but there is a definite overall improvement in the 
patched version; I'd say these results justify applying the patch.

I won't post them here, (mostly  because I didn't save them :) ) but I've run 
the same benchmark (with some variation in the number of iterations) and 
noticed that while there are always a couple of places where the unpatched 
version appears to do slightly better, the place at which this occurs is not 
consistent, and the cases where the patched version improves throughput always 
dominate.

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794588#action_12794588
 ] 

Steven Rowe commented on LUCENE-2084:
-

To run the benchmark:

# Unpack {{collation.benchmark.tar.bz2}} in a full Lucene-java tree into the 
{{contrib/benchmark/}} directory.  All contents will be put under a new 
directory named {{collation/}}.
# Compile and jarify the localized (ICU)CollationKeyAnalyzer code: from the 
{{collation/}} directory, run the script {{build.test.analyzer.jar.sh}}.
# From an unpatched {{java/trunk/}}, build Lucene's jars: {{ant clean package}}.
# From the {{contrib/benchmark/}} directory, run the collation benchmark: 
{{collation/run-benchmark.sh > unpatched.collation.bm.table.txt}}
# Apply the attached patch to the Lucene-java tree
# From {{java/trunk/}}, build Lucene's jars: {{ant clean package}}.
# From the {{contrib/benchmark/} directory, run the collation benchmark: 
{{collation/run-benchmark.sh > patched.collation.bm.table.txt}}
# Produce the comparison table:  
{{collation/compare.collation.benchmark.tables.pl 
unpatched.collation.bm.table.txt patched.collation.bm.table.txt > 
collation.diff.bm.table.txt}}



> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2084:


Attachment: collation.benchmark.tar.bz2

Atached collation.benchmark.tar.bz2, which contains stuff to run an 
analysis-only contrib benchmark for the (ICU)CollationKeyAnalyzers over 4 
languages: English, French, German, and Ukrainian.

Included are:

# For each language, a line-doc containing the most frequent 100K words from a 
corresponding Wikipedia dump from November 2009;
# For each language, Java code for a no-argument analyzer callable from a 
benchmark alg, that specializes (ICU)CollationKeyAnalyzer and uses 
PerFieldAnalyzerWrapper to only run it over the line-doc body field
# A script to compile and jarify the above analyzers
# A benchmark alg running 5 iterations of 10 repetitions of analysis only over 
the line-doc for each language
# A script to find the minimum elapsed time for each combination, and output 
the results as a JIRA table
# A script to run the previous two scripts once for each of three JDK versions
# A script to compare the output of the above script before and after applying 
the attached patch removing Char/ByteBuffer wrapping, and output the result as 
a JIRA table


> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, 
> LUCENE-2084.patch
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-12-25 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2084:


Attachment: LUCENE-2084.patch

synched to current trunk, after the LUCENE-2124 move

> remove Byte/CharBuffer wrapping for collation key generation
> 
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2084.patch, LUCENE-2084.patch
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2178) Benchmark contrib should allow multiple locations in ext.classpath

2009-12-22 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793627#action_12793627
 ] 

Steven Rowe commented on LUCENE-2178:
-

Trivial patch to fix (works with single or multiple locations):

{code}
Index: contrib/benchmark/build.xml
===
--- contrib/benchmark/build.xml (revision 892657)
+++ contrib/benchmark/build.xml (working copy)
@@ -114,7 +114,7 @@
 
 
 
-
+
 
 
 
{code}

> Benchmark contrib should allow multiple locations in ext.classpath
> --
>
> Key: LUCENE-2178
> URL: https://issues.apache.org/jira/browse/LUCENE-2178
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
>
> When {{ant run-task}} is invoked with the  {{-Dbenchmark.ext.classpath=...}} 
> option, only a single location may be specified.  If a classpath with more 
> than one location is specified, none of the locations is put on the classpath 
> for the invoked JVM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2178) Benchmark contrib should allow multiple locations in ext.classpath

2009-12-22 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2178:


Description: When {{ant run-task}} is invoked with the  
{{-Dbenchmark.ext.classpath=...}} option, only a single location may be 
specified.  If a classpath with more than one location is specified, none of 
the locations is put on the classpath for the invoked JVM.  (was: When {{ant 
run-task}} is invoked with the  {{-Dbenchmark.ext.classpath=...} option, only a 
single location may be specified.  If a classpath with more than one location 
is specified, none of the locations is put on the classpath for the invoked 
JVM.)

> Benchmark contrib should allow multiple locations in ext.classpath
> --
>
> Key: LUCENE-2178
> URL: https://issues.apache.org/jira/browse/LUCENE-2178
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 3.0
>    Reporter: Steven Rowe
>Priority: Minor
>
> When {{ant run-task}} is invoked with the  {{-Dbenchmark.ext.classpath=...}} 
> option, only a single location may be specified.  If a classpath with more 
> than one location is specified, none of the locations is put on the classpath 
> for the invoked JVM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2178) Benchmark contrib should allow multiple locations in ext.classpath

2009-12-22 Thread Steven Rowe (JIRA)
Benchmark contrib should allow multiple locations in ext.classpath
--

 Key: LUCENE-2178
 URL: https://issues.apache.org/jira/browse/LUCENE-2178
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 3.0
Reporter: Steven Rowe
Priority: Minor


When {{ant run-task}} is invoked with the  {{-Dbenchmark.ext.classpath=...} 
option, only a single location may be specified.  If a classpath with more than 
one location is specified, none of the locations is put on the classpath for 
the invoked JVM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2124) move JDK collation to core, ICU collation to ICU contrib

2009-12-20 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793074#action_12793074
 ] 

Steven Rowe commented on LUCENE-2124:
-

Robert, I noticed something you missed in the move - here's a trivial patch:

{code}
Index: contrib/icu/src/java/overview.html
===
--- contrib/icu/src/java/overview.html  (revision 892657)
+++ contrib/icu/src/java/overview.html  (working copy)
@@ -34,7 +34,7 @@
   CollationKeys.  icu4j-collation-4.0.jar, 
   a trimmed-down version of icu4j-4.0.jar that contains only the 
   code and data needed to support collation, is included in Lucene's 
Subversion 
-  repository at contrib/collation/lib/.
+  repository at contrib/icu/lib/.
 
 
 Use Cases
{code}

> move JDK collation to core, ICU collation to ICU contrib
> 
>
> Key: LUCENE-2124
> URL: https://issues.apache.org/jira/browse/LUCENE-2124
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*, Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2124.patch, LUCENE-2124.patch
>
>
> As mentioned on the list, I propose we move the JDK-based 
> CollationKeyFilter/CollationKeyAnalyzer, currently located in 
> contrib/collation into core for collation support (language-sensitive sorting)
> These are not much code (the heavy duty stuff is already in core, 
> IndexableBinaryString). 
> And I would also like to move the 
> ICUCollationKeyFilter/ICUCollationKeyAnalyzer (along with the jar file they 
> depend on) also currently located in contrib/collation into a contrib/icu.
> This way, we can start looking at integrating other functionality from ICU 
> into a fully-fleshed out icu contrib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2124) move JDK collation to core, ICU collation to ICU contrib

2009-12-08 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787654#action_12787654
 ] 

Steven Rowe commented on LUCENE-2124:
-

bq. this will move the contrib/collation JDK-based components to core

+1

bq. and later we should consider deprecating the alternatives that are not 
scalable.

The alternatives don't scale well, true, but they don't result in 
non-human-readable index terms, either, so for people that need human-readable 
index terms and who have a low-cardinality term set, maybe we should leave the 
alternatives in place?

bq. this will move the contrib/collation ICU based components to contrib/iCU, 
and this is where I want to bring the unicode 5.2 support.

+1

> move JDK collation to core, ICU collation to ICU contrib
> 
>
> Key: LUCENE-2124
> URL: https://issues.apache.org/jira/browse/LUCENE-2124
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*, Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2124.patch, LUCENE-2124.patch
>
>
> As mentioned on the list, I propose we move the JDK-based 
> CollationKeyFilter/CollationKeyAnalyzer, currently located in 
> contrib/collation into core for collation support (language-sensitive sorting)
> These are not much code (the heavy duty stuff is already in core, 
> IndexableBinaryString). 
> And I would also like to move the 
> ICUCollationKeyFilter/ICUCollationKeyAnalyzer (along with the jar file they 
> depend on) also currently located in contrib/collation into a contrib/icu.
> This way, we can start looking at integrating other functionality from ICU 
> into a fully-fleshed out icu contrib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785524#action_12785524
 ] 

Steven Rowe commented on LUCENE-2074:
-

Thanks, Uwe, that makes sense.  My bad, I only skimmed the patch, and 
misunderstood "3.0" in one of the new files to refer to the Lucene version, not 
the Unicode version. :)

> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785414#action_12785414
 ] 

Steven Rowe commented on LUCENE-2074:
-

bq. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the 
moment?

I think it's fine to do that.

bq. The new parsers (see patch) are pre-generated in SVN, so somebody compiling 
lucene from source does need to use jflex. And the parsers for 
StandardTokenizer are verified to work correct and are even identical (DFA 
wise) for the old Java 1.4 / Unicode 3.0 case.

Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms 
- the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], 
which under JFlex <1.5 were expanded using the scanner-generation-time JVM's 
Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT 
depend on the declared Unicode version definitions (i.e., [:letter:] = 
\p{Letter}).

I'm actually surprised that the DFAs are identical, since I'm almost certain 
that the set of characters matching [:letter:] changed between Unicode 3.0 and 
Unicode 4.0 (maybe [:digit:] too).  I'll take a look this weekend.


> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785344#action_12785344
 ] 

Steven Rowe commented on LUCENE-2074:
-

bq. Will the old jflex fail on %unicode {x.y} syntax ???

I haven't tested it, but JFlex <1.5 likely will fail on this syntax, since 
nothing is expected after the %unicode directive.

bq. Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy.

I think the JFlex 1.5 release will happen before March of next year, since 
we're down to just a few blocking issues.


> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2077) changes-to-html: better handling of bulleted lists in CHANGES.txt

2009-11-17 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2077:


Attachment: LUCENE-2077.patch

Patch to handle bulleted lists in CHANGES.txt, and remove  tag 
workarounds from CHANGES.txt.

> changes-to-html: better handling of bulleted lists in CHANGES.txt
> -
>
> Key: LUCENE-2077
> URL: https://issues.apache.org/jira/browse/LUCENE-2077
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Affects Versions: 2.9.1
>    Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-2077.patch
>
>
> - bulleted lists
> - should be rendered
> - as such
> - in output HTML

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2077) changes-to-html: better handling of bulleted lists in CHANGES.txt

2009-11-17 Thread Steven Rowe (JIRA)
changes-to-html: better handling of bulleted lists in CHANGES.txt
-

 Key: LUCENE-2077
 URL: https://issues.apache.org/jira/browse/LUCENE-2077
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Affects Versions: 2.9.1
Reporter: Steven Rowe
Priority: Trivial
 Fix For: 3.0


- bulleted lists
- should be rendered
- as such
- in output HTML

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1689) supplementary character handling

2009-11-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778528#action_12778528
 ] 

Steven Rowe commented on LUCENE-1689:
-

I don't know if this is the right place to point this out, but: JFlex-generated 
scanners (e.g. StandardAnalyzer) do not properly handle supplementary 
characters.

Unfortunately, it looks like the as-yet-unreleased JFlex 1.5 will not support 
supplementary characters either, so this will be a gap in Lucene's Unicode 
handling for a while.

> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, 
> LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772211#action_12772211
 ] 

Steven Rowe commented on LUCENE-2019:
-

{quote}
Steven, by the way, I think something i havent been able to communicate 
properly, is that I feel very strongly that storing noncharacters in term text 
where they are treated as abstract characters, is very different than using 
them as sentinel values / delimiters / etc in the index format, I think this is 
ok and is what they are for.

but term text is different, search engines index human language and by putting 
noncharacters in term text you are treating them as abstract characters.
{quote}

Robert, you are a proponent of the (ICU)CollationKeyFilter functionality, which 
uses IndexableBinaryStringTools to store arbitrary binary data in a Lucene 
index.  These filters store non-human-readable terms in the index.  I can think 
of several other examples of using Lucene indexes to store non-human-language 
terms.

Character data, in addition to representing characters, is *data*.  Bits.  I 
would argue that you *always* need context to figure out what bits represent.

> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772184#action_12772184
 ] 

Steven Rowe commented on LUCENE-2019:
-

bq. by disallowing all noncharacters as term text, lucene is *more free* to use 
them as delimiters, and sentinel values, and such, as specified in chapter 3 of 
the standard.

Lucene is more free, but Lucene's users are not.  Quite the contrary.

IMHO, Lucene's users (applications that incorporate the Lucene library) should 
be able to use Unicode data in ways that the standard allows ("Applications are 
free to use any of these noncharacter code points internally").

U+ was chosen for Lucene-internal use for reasons very similar to those 
you're bringing up, Robert: something like "who would ever want to use 
non-characters in an index?"  However, this choice does not obligate Lucene to 
take the same action for all other non-characters.

I think the fix here is documentation, not proscription.


> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772174#action_12772174
 ] 

Steven Rowe commented on LUCENE-2019:
-

bq. if you disagree with this patch, then you should also disagree with 
treating U+ special! 

Quoting myself from an earlier comment on this issue (apoligies):

bq. Instituting this consistency precludes Lucene-index-as-process-internal use 
cases. I would argue that the price of consistency is in this case too high.

So you think that enforcing consistency is worth the cost of disallowing some 
usages, and I don't.

> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772164#action_12772164
 ] 

Steven Rowe commented on LUCENE-2019:
-

Lucene is not an application.

Again, quoting from section 16.7 (emphasis mine):

bq. *Applications* are free to use any of these noncharacter code points 
internally but should never attempt to exchange them.

The forbidden operation is exchanging non-characters across the *application* 
boundary.  

Asking Lucene to store non-characters for you is not a violation of the Unicode 
standard.  Lucene agreeing to do so is not a violation of the Unicode standard.

If a Lucene user later uses a Lucene index to exchange data (of whatever form) 
across the application boundary, that's on the user, not on Lucene.

(I'll skip the Lucene-as-a-weapon metaphor.  You can thank me later.)


> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772151#action_12772151
 ] 

Steven Rowe commented on LUCENE-2019:
-

bq. process-internal is somethign that won't be stored or interchanged in any 
way (internal to the process)

Right, this is the crux of the disagreement: you think storage (with the 
exception of in-memory usage) means interchange.  I and Yonik think that 
storage does not necessarily mean interchange.

Section 16.7 (_Noncharacters_) of the Unicode 5.0.0 standand (the latest 
version for which an electronic version of this chapter is available), says:

{quote}
Noncharacters are code points that are permanently reserved in the Unicode 
Standard for internal use. They are forbidden for use in open interchange of 
Unicode text data. See Section 3.4, Characters and Encoding, for the formal 
definition of noncharacters and conformance requirements related to their use.

The Unicode Standard sets aside 66 noncharacter code points. The last two code 
points of each plane are noncharacters: U+FFFE and U+ on the BMP, U+1FFFE 
and U+1 on Plane 1, and so on, up to U+10FFFE and U+10 on Plane 16, for 
a total of 34 code points. In addition, there is a contiguous range of another 
32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, 
the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A 
block, but those noncharacters are not "Arabic noncharacters" or "right-to-left 
noncharacters," and are not distinguished in any other way from the other 
noncharacters, except in their code point values.

Applications are free to use any of these noncharacter code points internally 
but should never attempt to exchange them. If a noncharacter is received in 
open interchange, an application is not required to interpret it in any way. It 
is good practice, however, to recognize it as a noncharacter and to take 
appropriate action, such as removing it from the text. Note that Unicode 
conformance freely allows the removal of these characters. (See conformance 
clause C7 in Section 3.2, Conformance Requirements.)

In effect, noncharacters can be thought of as application-internal private-use 
code points. Unlike the private-use characters discussed in Section 16.5, 
Private-Use Characters, which are assigned characters and which are intended 
for use in open interchange, subject to interpretation by private agreement, 
noncharacters are permanently reserved (unassigned) and have no interpretation 
whatsoever outside of their possible application-internal private uses.

*U+ and U+10.*  These two noncharacter code points have the attribute 
of being associated with the largest code unit values for particular Unicode 
encoding forms. In UTF-16, U+ is associated with the largest 16-bit code 
unit value, 16. U+10 is associated with the largest legal UTF-32 32-bit 
code unit value, 1016. This attribute renders these two noncharacter code 
points useful for internal purposes as sentinels. For example, they might be 
used to indicate the end of a list, to represent a value in an index guaranteed 
to be higher than any valid character value, and so on.
{quote}

(I left out the last part about U+FFFE.)

Again, the crux of the matter is the definition of "open interchange".

> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772133#action_12772133
 ] 

Steven Rowe commented on LUCENE-2019:
-

bq. Steven, the only reason I might disagree is that a Lucene Index is supposed 
to be portable across different languages other than Lucene Java.

Right, but not all Lucene indexes in-the-wild are accessed from more than one 
language.  The vast majority of Lucene index uses, I'd venture to guess, are 
single-language, single-process uses.

bq. in my opinion, if you are to store process-internal codepoints as abstract 
characters in terms, then you should not claim that Lucene indexes are in any 
Unicode format, because then they violate the standard.

I strongly disagree with the assumption that interchange and serialization are 
synonymous.

bq. By *not* storing them in terms, then you are free to use them as 
delimiters, or other purposes. right now U+ is used as a delimiter, but who 
knows, maybe someday you might need more?

I actually agree with this argument.  What if Lucene needs more 
process-internal characters?  I don't have any way of gauging the probability 
that it will in the future (other than the last eight years of history, during 
which only one was deemed necessary).  But what does Mike M. say? "Design for 
now" or something like that?

> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

2009-10-30 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772118#action_12772118
 ] 

Steven Rowe commented on LUCENE-2019:
-

Lucene indexes can be used both process-internally and across processes (e.g. 
Solr).

This patch enforces the Lucene-index-as-process-external view, and excludes the 
possiblity that a Lucene index is used process-internally.

Since Lucene itself uses U+ internally, no clients can use it for their own 
purposes.  This patch rationalizes handling of internal-use-only characters, 
such that Lucene's behavior is made consistent for all of them.

Instituting this consistency precludes Lucene-index-as-process-internal use 
cases.  I would argue that the price of consistency is in this case too high.

My vote: document the crap out of the U+ Lucene-internal-use character and 
drop this patch.

If people want to use internal-use-only characters in Lucene indexes, as long 
as Lucene doesn't reserve them for its own use, why stop them?


> map unicode process-internal codepoints to replacement character
> 
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+ 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1902) Changes.html not explicitly included in release

2009-09-08 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752770#action_12752770
 ] 

Steven Rowe commented on LUCENE-1902:
-

Maybe *Main* should be changed to be the conventional *Core* (the standard term 
when differentiating from *Contrib*) in the new Changes menu?

> Changes.html not explicitly included in release
> ---
>
> Key: LUCENE-1902
> URL: https://issues.apache.org/jira/browse/LUCENE-1902
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1902.patch, LUCENE-1902.patch
>
>
> None of the release related ant targets explicitly call cahnges-to-html ... 
> this seems like an oversight.  (currently it's only called as part of the 
> nightly target)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes

2009-09-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752323#action_12752323
 ] 

Steven Rowe edited comment on LUCENE-1898 at 9/7/09 8:55 PM:
-

Patch to changes2html.pl that can handle '\*' as bulleted item indicator.  Also 
converts numbered items in contrib/CHANGES.txt for 2.9 release to '\*' bullets. 
 

This patch incorporates Mark's numbered->bulleted modifications to CHANGES.txt, 
as well as correcting one numbered item that Mark missed, and converting tabs 
to spaces in the first  section, so that the method parameters line up in 
the output HTML.

  was (Author: steve_rowe):
Patch to changes2html.pl that can handle '*' as bulleted item indicator.  
Also converts numbered items in contrib/CHANGES.txt for 2.9 release to '*' 
bullets.  

This patch incorporates Mark's numbered->bulleted modifications to CHANGES.txt, 
as well as correcting one numbered item that Mark missed, and converting tabs 
to spaces in the first  section, so that the method parameters line up in 
the output HTML.
  
> Decide if we should remove lines numbers from latest Changes
> 
>
> Key: LUCENE-1898
> URL: https://issues.apache.org/jira/browse/LUCENE-1898
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1898.patch, LUCENE-1898.patch
>
>
> As Lucene dev has grown, a new issue has arisen - many times, new changes 
> invalidate old changes. A proper changes file should just list the changes 
> from the last version, not document the dev life of the issues. Keeping 
> changes in proper order now requires a lot of renumbering sometimes. The 
> numbers have no real meaning and could be added to more rich versions (such 
> as the html version) automatically if desired.
> I think an * makes a good replacement myself. The issues already have ids 
> that are stable, rather than the current, decorational numbers which are 
> subject to change over a dev cycle.
> I think we should replace the numbers with an asterix for the 2.9 section and 
> going forward (ie 4. becomes *).
> If we don't get consensus very quickly, this issue won't block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes

2009-09-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752286#action_12752286
 ] 

Steven Rowe edited comment on LUCENE-1898 at 9/7/09 8:56 PM:
-

{{changes2html.pl}} doesn't fully grok the new format - items are numbered, but 
the asterisks are left in in some cases.  I'll work up a patch.

  was (Author: steve_rowe):
{{changes-to-html.pl}} doesn't fully grok the new format - items are 
numbered, but the asterisks are left in in some cases.  I'll work up a patch.
  
> Decide if we should remove lines numbers from latest Changes
> 
>
> Key: LUCENE-1898
> URL: https://issues.apache.org/jira/browse/LUCENE-1898
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1898.patch, LUCENE-1898.patch
>
>
> As Lucene dev has grown, a new issue has arisen - many times, new changes 
> invalidate old changes. A proper changes file should just list the changes 
> from the last version, not document the dev life of the issues. Keeping 
> changes in proper order now requires a lot of renumbering sometimes. The 
> numbers have no real meaning and could be added to more rich versions (such 
> as the html version) automatically if desired.
> I think an * makes a good replacement myself. The issues already have ids 
> that are stable, rather than the current, decorational numbers which are 
> subject to change over a dev cycle.
> I think we should replace the numbers with an asterix for the 2.9 section and 
> going forward (ie 4. becomes *).
> If we don't get consensus very quickly, this issue won't block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes

2009-09-07 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1898:


Attachment: LUCENE-1898.patch

Patch to changes2html.pl that can handle '*' as bulleted item indicator.  Also 
converts numbered items in contrib/CHANGES.txt for 2.9 release to '*' bullets.  

This patch incorporates Mark's numbered->bulleted modifications to CHANGES.txt, 
as well as correcting one numbered item that Mark missed, and converting tabs 
to spaces in the first  section, so that the method parameters line up in 
the output HTML.

> Decide if we should remove lines numbers from latest Changes
> 
>
> Key: LUCENE-1898
> URL: https://issues.apache.org/jira/browse/LUCENE-1898
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1898.patch, LUCENE-1898.patch
>
>
> As Lucene dev has grown, a new issue has arisen - many times, new changes 
> invalidate old changes. A proper changes file should just list the changes 
> from the last version, not document the dev life of the issues. Keeping 
> changes in proper order now requires a lot of renumbering sometimes. The 
> numbers have no real meaning and could be added to more rich versions (such 
> as the html version) automatically if desired.
> I think an * makes a good replacement myself. The issues already have ids 
> that are stable, rather than the current, decorational numbers which are 
> subject to change over a dev cycle.
> I think we should replace the numbers with an asterix for the 2.9 section and 
> going forward (ie 4. becomes *).
> If we don't get consensus very quickly, this issue won't block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes

2009-09-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752286#action_12752286
 ] 

Steven Rowe commented on LUCENE-1898:
-

{{changes-to-html.pl}} doesn't fully grok the new format - items are numbered, 
but the asterisks are left in in some cases.  I'll work up a patch.

> Decide if we should remove lines numbers from latest Changes
> 
>
> Key: LUCENE-1898
> URL: https://issues.apache.org/jira/browse/LUCENE-1898
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1898.patch
>
>
> As Lucene dev has grown, a new issue has arisen - many times, new changes 
> invalidate old changes. A proper changes file should just list the changes 
> from the last version, not document the dev life of the issues. Keeping 
> changes in proper order now requires a lot of renumbering sometimes. The 
> numbers have no real meaning and could be added to more rich versions (such 
> as the html version) automatically if desired.
> I think an * makes a good replacement myself. The issues already have ids 
> that are stable, rather than the current, decorational numbers which are 
> subject to change over a dev cycle.
> I think we should replace the numbers with an asterix for the 2.9 section and 
> going forward (ie 4. becomes *).
> If we don't get consensus very quickly, this issue won't block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release

2009-09-02 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750477#action_12750477
 ] 

Steven Rowe commented on LUCENE-1883:
-

I searched just now, but couldn't find, an email thread I recall on java-dev 
between Doug Cutting and the RM at that point (several years ago) about 
modifying past releases' CHANGES.txt entries.  Doug's position, articulated 
both in that thread (and elsewhere, IIRC), was that people depend on being able 
to do a diff between CHANGES.txt versions, so once a release was cut, the 
release notes should never change thereafter.


> Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
> -
>
> Key: LUCENE-1883
> URL: https://issues.apache.org/jira/browse/LUCENE-1883
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1883.patch
>
>
> I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt.  (Once they 
> make it past a release, they're set in stone...)
> Will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release

2009-09-01 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1883:


Attachment: LUCENE-1883.patch

patch with typos corrected

> Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
> -
>
> Key: LUCENE-1883
> URL: https://issues.apache.org/jira/browse/LUCENE-1883
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>    Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1883.patch
>
>
> I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt.  (Once they 
> make it past a release, they're set in stone...)
> Will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release

2009-09-01 Thread Steven Rowe (JIRA)
Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
-

 Key: LUCENE-1883
 URL: https://issues.apache.org/jira/browse/LUCENE-1883
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Steven Rowe
Priority: Trivial
 Fix For: 2.9
 Attachments: LUCENE-1883.patch

I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt.  (Once they make 
it past a release, they're set in stone...)

Will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release

2009-09-01 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1883:


Lucene Fields: [New, Patch Available]  (was: [New])

> Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
> -
>
> Key: LUCENE-1883
> URL: https://issues.apache.org/jira/browse/LUCENE-1883
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>    Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1883.patch
>
>
> I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt.  (Once they 
> make it past a release, they're set in stone...)
> Will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1873) Update site lucene-sandbox page

2009-09-01 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750148#action_12750148
 ] 

Steven Rowe commented on LUCENE-1873:
-

I think we should add generation of {{Contrib-Changes.html}} from 
{{contrib/CHANGES.txt}} to the {{changes-to-html}} target in {{build.xml}}:

{code:xml}

  

{code}

and then link to it from near the top of {{lucene-sandbox/index.xml}}, 
something like:

{code:html}

  See http://lucene.apache.org/java/2_9_0/changes/Contrib-Changes.html";>Contrib 
CHANGES for changes included in the current release.

{code}

> Update site lucene-sandbox page
> ---
>
> Key: LUCENE-1873
> URL: https://issues.apache.org/jira/browse/LUCENE-1873
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1873.patch
>
>
> The page has misleading/bad info. One thing I would like to do - but I won't 
> attempt now (prob good for the modules issue) - is commit to one word - 
> contrib or sandbox. I think sandbox should be purged myself.
> The current page says that the sandbox is kind of a rats nest with various 
> early stage software that one day may make it into core - that info is 
> outdated I think. We should replace it, and also specify how the back compat 
> policy works in contrib eg each contrib can have its own policy, with the 
> default being no policy.
> We should also drop the piece about being open to Lucene's committers and 
> others - a bit outdated.
> We should also either include the other contribs, or change the wording to 
> indicate that the list is only a sampling of the many contribs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1865) Add a ton of missing license headers throughout test/demo/contrib

2009-09-01 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750128#action_12750128
 ] 

Steven Rowe commented on LUCENE-1865:
-

Two minor license nits:

* Mark's r808567 commit under this issue added license declarations to two 
files that already had them, though the original declarations are slightly 
differently worded (they contain copyright notices).  These two files now each 
contain two license declarations:

{{contrib/benchmark/src/java/org/apache/lucene/benchmark/package.html}}
{{contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/package.html}}

* I don't know if it matters, but the following three files contain license 
declarations that include copyright notices ("Copyright 2005 The Apache 
Software Foundation"), unlike all the license declarations Mark added recently:

{{contrib/instantiated/src/java/org/apache/lucene/store/instantiated/package.html}}
{{src/java/org/apache/lucene/search/function/package.html}}
{{src/java/org/apache/lucene/search/payloads/package.html}}


> Add a ton of missing license headers throughout test/demo/contrib
> -
>
> Key: LUCENE-1865
> URL: https://issues.apache.org/jira/browse/LUCENE-1865
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1865-part2.patch, LUCENE-1865.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1880) Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public

2009-08-31 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1880:


Attachment: LUCENE-1880.patch

trivial patch adding public access to currently package private constructors

> Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public
> 
>
> Key: LUCENE-1880
> URL: https://issues.apache.org/jira/browse/LUCENE-1880
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>    Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1880.patch
>
>
> In contrib/collation, the constructors for CollationKeyAnalyzer and 
> ICUCollationKeyAnalyzer are package private, and so are effectively unusable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1880) Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public

2009-08-31 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1880:


Lucene Fields: [New, Patch Available]  (was: [New])

> Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public
> 
>
> Key: LUCENE-1880
> URL: https://issues.apache.org/jira/browse/LUCENE-1880
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>    Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1880.patch
>
>
> In contrib/collation, the constructors for CollationKeyAnalyzer and 
> ICUCollationKeyAnalyzer are package private, and so are effectively unusable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1880) Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public

2009-08-31 Thread Steven Rowe (JIRA)
Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public


 Key: LUCENE-1880
 URL: https://issues.apache.org/jira/browse/LUCENE-1880
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Steven Rowe
Priority: Trivial
 Fix For: 2.9


In contrib/collation, the constructors for CollationKeyAnalyzer and 
ICUCollationKeyAnalyzer are package private, and so are effectively unusable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1876) Some contrib packages are missing a package.html

2009-08-31 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1876:


Attachment: collation-package.html

Here is {{package.html}} for contrib/collation, with content mostly stolen from 
class comments and test cases.

The Turkish collation example is mostly stolen from Robert Muir's 
TestTurkishCollation.java from LUCENE-1581.

> Some contrib packages are missing a package.html
> 
>
> Key: LUCENE-1876
> URL: https://issues.apache.org/jira/browse/LUCENE-1876
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: collation-package.html
>
>
> Dunno if we will get to this one this release, but a few contribs don't have 
> a package.html (or a good overview that would work as a replacement) - I 
> don't think this is hugely important, but I think it is important - you 
> should be able to easily and quickly read a quick overview for each contrib I 
> think.
> So far I have identified collation and spatial.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1873) Update site lucene-sandbox page

2009-08-31 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749524#action_12749524
 ] 

Steven Rowe commented on LUCENE-1873:
-

I'm +1 on switching away from "Sandbox" (no longer used at all) to Contrib.

Before you posted your patch, I had written up a new intro for the contrib 
index.html - feel free to take any of this or ignore it :) :

{code:html}
  
The Lucene Java project also contains a workspace, Lucene Contrib
(formerly known as the Lucene Sandbox), that is open both to all Lucene 
Java core committers and to developers whose commit rights are 
restricted to the Contrib workspace; these developers are referred to 
as "Contrib committers".  The Lucene Contrib workspace hosts the 
following types of packages:
  
  
Various third party contributions.

  Contributions with third party dependencies - the Lucene Java core
  distribution has no external runtime dependencies.


  New ideas that are intended for eventual inclusion into the Lucene 
  Java core.

  
  
Users are free to experiment with the components developed in the
Contrib workspace, but Contrib packages will not necessarily be
maintained, particularly in their current state. The Lucene Java core 
backwards compatibility commitments (see
http://wiki.apache.org/lucene-java/BackwardsCompatibility";
  >http://wiki.apache.org/lucene-java/BackwardsCompatibility)
do not necessarily extend to the packages in the Contrib workspace.
See the README.txt file for each Contrib package for details.  If the
README.txt file does not address its backwards compatibility
commitments, users should assume it does not make any compatibility
commitments.
  
{code}

> Update site lucene-sandbox page
> ---
>
> Key: LUCENE-1873
> URL: https://issues.apache.org/jira/browse/LUCENE-1873
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1873.patch
>
>
> The page has misleading/bad info. One thing I would like to do - but I won't 
> attempt now (prob good for the modules issue) - is commit to one word - 
> contrib or sandbox. I think sandbox should be purged myself.
> The current page says that the sandbox is kind of a rats nest with various 
> early stage software that one day may make it into core - that info is 
> outdated I think. We should replace it, and also specify how the back compat 
> policy works in contrib eg each contrib can have its own policy, with the 
> default being no policy.
> We should also drop the piece about being open to Lucene's committers and 
> others - a bit outdated.
> We should also either include the other contribs, or change the wording to 
> indicate that the list is only a sampling of the many contribs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060
 ] 

Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM:
---

bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as 
j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", 
unless you explicity append a "$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc "contract" on RegexCapabilities.match() just says 
"@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() 
instead of lookingAt().

  was (Author: steve_rowe):
bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing ".*", unless you explicity append a 
"$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc "contract" on RegexCapabilities.match() just says 
"@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().
  
> RegexQuery matches terms the input regex doesn't actually match
> ---
>
> Key: LUCENE-1683
> URL: https://issues.apache.org/jira/browse/LUCENE-1683
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.3.2
>Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex 
> classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ 
> following letters (e.g. "cathy", "catcher", ...)  It is as if there is an 
> implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
> @Test
> public void testNecessity() throws Exception {
> File dir = new File(new File(System.getProperty("java.io.tmpdir")), 
> "index");
> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
> true);
> try {
> Document doc = new Document();
> doc.add(new Field("field", "cat cats cathy", Field.Store.YES, 
> Field.Index.TOKENIZED));
> writer.addDocument(doc);
> } finally {
> writer.close();
> }
> IndexReader reader = IndexReader.open(dir);
> try {
> TermEnum terms = new RegexQuery(new Term("field", 
> "cat.")).getEnum(reader);
> assertEquals("Wrong term", "cats", terms.term());
> assertFalse("Should have only been one term", terms.next());
> } finally {
> reader.close();
> }
> }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
> String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060
 ] 

Steven Rowe commented on LUCENE-1683:
-

bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing ".*", unless you explicity append a 
"$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc "contract" on RegexCapabilities.match() just says 
"@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().

> RegexQuery matches terms the input regex doesn't actually match
> ---
>
> Key: LUCENE-1683
> URL: https://issues.apache.org/jira/browse/LUCENE-1683
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.3.2
>Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex 
> classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ 
> following letters (e.g. "cathy", "catcher", ...)  It is as if there is an 
> implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
> @Test
> public void testNecessity() throws Exception {
> File dir = new File(new File(System.getProperty("java.io.tmpdir")), 
> "index");
> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
> true);
> try {
> Document doc = new Document();
> doc.add(new Field("field", "cat cats cathy", Field.Store.YES, 
> Field.Index.TOKENIZED));
> writer.addDocument(doc);
> } finally {
> writer.close();
> }
> IndexReader reader = IndexReader.open(dir);
> try {
> TermEnum terms = new RegexQuery(new Term("field", 
> "cat.")).getEnum(reader);
> assertEquals("Wrong term", "cats", terms.term());
> assertFalse("Should have only been one term", terms.next());
> } finally {
> reader.close();
> }
> }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
> String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

2009-06-28 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1719:


Attachment: LUCENE-1719.patch

Updated patch including information about ICU4J's shorter key length; adding a 
link to the ICU4J documentation's comparison of ICU4J and java.text.Collator 
key generation time and key length; and removing specific performance numbers.

> Add javadoc notes about ICUCollationKeyFilter's advantages over 
> CollationKeyFilter
> --
>
> Key: LUCENE-1719
> URL: https://issues.apache.org/jira/browse/LUCENE-1719
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1719.patch, LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
> faster than CollationKeyFilter, the JVM-provided java.text.Collator 
> implementation in the same package.  The javadocs of these classes should be 
> modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's 
> comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
>  on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
> faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 
> 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 
> 4 languages (taken from the corresponding Debian wordlist packages and 
> truncated to the first 90k words after a fixed random shuffling), using 
> Collators at the default strength, on a Windows Vista 64-bit machine.  I used 
> an analysis pipeline consisting of WhitespaceTokenizer chained to the 
> collation key filter, so to isolate the time taken by the collation key 
> filters, I also timed WhitespaceTokenizer operating alone for each 
> combination.  The rightmost column represents the performance advantage of 
> the ICU4J implemtation (ICU) over the java.text.Collator implementation 
> (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / 
> (ICU-WST). The best times out of 5 runs for each combination, in 
> milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J 
> Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

2009-06-28 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725023#action_12725023
 ] 

Steven Rowe commented on LUCENE-1719:
-

bq. [...] i searched lucene source code for java.text.Collator and found some 
uses of it (the incremental facility). I wonder if in the future we could find 
a way to allow usage of com.ibm.icu.text.Collator in these spots.

+1

I guess the way to go would be to make the implementation pluggable.

> Add javadoc notes about ICUCollationKeyFilter's advantages over 
> CollationKeyFilter
> --
>
> Key: LUCENE-1719
> URL: https://issues.apache.org/jira/browse/LUCENE-1719
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
> faster than CollationKeyFilter, the JVM-provided java.text.Collator 
> implementation in the same package.  The javadocs of these classes should be 
> modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's 
> comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
>  on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
> faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 
> 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 
> 4 languages (taken from the corresponding Debian wordlist packages and 
> truncated to the first 90k words after a fixed random shuffling), using 
> Collators at the default strength, on a Windows Vista 64-bit machine.  I used 
> an analysis pipeline consisting of WhitespaceTokenizer chained to the 
> collation key filter, so to isolate the time taken by the collation key 
> filters, I also timed WhitespaceTokenizer operating alone for each 
> combination.  The rightmost column represents the performance advantage of 
> the ICU4J implemtation (ICU) over the java.text.Collator implementation 
> (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / 
> (ICU-WST). The best times out of 5 runs for each combination, in 
> milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J 
> Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

2009-06-28 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1719:


Description: 
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
faster than CollationKeyFilter, the JVM-provided java.text.Collator 
implementation in the same package.  The javadocs of these classes should be 
modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's 
comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
 on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 
1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages 
(taken from the corresponding Debian wordlist packages and truncated to the 
first 90k words after a fixed random shuffling), using Collators at the default 
strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline 
consisting of WhitespaceTokenizer chained to the collation key filter, so to 
isolate the time taken by the collation key filters, I also timed 
WhitespaceTokenizer operating alone for each combination.  The rightmost column 
represents the performance advantage of the ICU4J implemtation (ICU) over the 
java.text.Collator implementation (JVM), after discounting the 
WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 
runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|156%|
|1.4.2_17 (32 bit)|French|716|243|14|207%|
|1.4.2_17 (32 bit)|German|669|264|16|163%|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
|1.5.0_15 (32 bit)|English|604|176|16|268%|
|1.5.0_15 (32 bit)|French|817|209|17|317%|
|1.5.0_15 (32 bit)|German|799|225|20|280%|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
|1.5.0_15 (64 bit)|English|431|89|10|433%|
|1.5.0_15 (64 bit)|French|562|112|11|446%|
|1.5.0_15 (64 bit)|German|567|116|13|438%|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
|1.6.0_13 (64 bit)|English|162|81|9|113%|
|1.6.0_13 (64 bit)|French|192|92|10|122%|
|1.6.0_13 (64 bit)|German|204|99|14|124%|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|


  was:
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
faster than CollationKeyFilter, the JVM-provided java.text.Collator 
implementation in the same package.  The javadocs of these classes should be 
modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's 
comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
 on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 
1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages 
(taken from the corresponding Debian wordlist packages and truncated to the 
first 90k words after a fixed random shuffling), using Collators at the default 
strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline 
consisting of WhitespaceTokenizer chained to the collation key filter, so to 
isolate the time taken by the collation key filters, I also timed 
WhitespaceTokenizer operating alone for each combination.  The rightmost column 
represents the performance advantage of the ICU4J implemtation (ICU) over the 
java.text.Collator implementation (JVM), after discounting the 
WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 
runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|


Summary: Add javadoc notes about ICUCollationKeyFilter's advantages 
over CollationKeyFilter  (was: Add javadoc notes about ICUCollationKeyFilter's 
speed advantage over CollationKeyFilter)

Edited title to reflect addition of key length concerns, and switched 
performance improvement column to be percentage improvements rather than 
multiplie

[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

2009-06-28 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724974#action_12724974
 ] 

Steven Rowe commented on LUCENE-1719:
-

Cool! Thanks for the link, Robert.

Key comparison under Lucene when using *CollationKeyAnalyzer will utilize 
neither ICU4J's nor the java.text incremental collation facilities - the 
base-8000h-String-encoded raw collation keys will be directly compared (and 
sorted) as Strings.  So key generation time and, as you point out, key length 
are the appropriate measures here.

I'll post a patch shortly that includes your ICU4J link, and mentions the key 
length aspect as well.  I'll also remove specific numbers from the javadoc 
notes - people can follow the ICU4J link if they're interested.

> Add javadoc notes about ICUCollationKeyFilter's speed advantage over 
> CollationKeyFilter
> ---
>
> Key: LUCENE-1719
> URL: https://issues.apache.org/jira/browse/LUCENE-1719
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>    Affects Versions: 2.4.1
>Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
> faster than CollationKeyFilter, the JVM-provided java.text.Collator 
> implementation in the same package.  The javadocs of these classes should be 
> modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's 
> comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
>  on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
> faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 
> 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 
> 4 languages (taken from the corresponding Debian wordlist packages and 
> truncated to the first 90k words after a fixed random shuffling), using 
> Collators at the default strength, on a Windows Vista 64-bit machine.  I used 
> an analysis pipeline consisting of WhitespaceTokenizer chained to the 
> collation key filter, so to isolate the time taken by the collation key 
> filters, I also timed WhitespaceTokenizer operating alone for each 
> combination.  The rightmost column represents the performance advantage of 
> the ICU4J implemtation (ICU) over the java.text.Collator implementation 
> (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / 
> (JVM-WST). The best times out of 5 runs for each combination, in 
> milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J 
> Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-06-27 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724926#action_12724926
 ] 

Steven Rowe commented on LUCENE-1581:
-

{quote}
you could add the JDK collation key filter to core if you wanted a core fix.

but the icu one is up to something like 30x faster than the jdk, so why bother 
:)
{quote}

LUCENE-1719 contains some timings I made about the relative speeds of these two 
implementations.  In short, for the platform/language/collator/JVM version 
combinations I tested, the ICU4J implementation's speed advantage ranges from 
1.4x to 5.5x.

> LowerCaseFilter should be able to be configured to use a specific locale.
> -
>
> Key: LUCENE-1581
> URL: https://issues.apache.org/jira/browse/LUCENE-1581
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Digy
> Attachments: TestTurkishCollation.java
>
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>   public class SomeAnalyzer : Analyzer
>   {
>   public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>   {
>   TokenStream t = new SomeTokenizer(reader);
>   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>   t = new LowerCaseFilter(t);
>   return t;
>   }
> 
>   }
> {code}
>   
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>   "i" (if locale is "en-US") 
>   or 
>   "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
> public sealed class LowerCaseFilter : TokenFilter
> {
> /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
> public LowerCaseFilter(TokenStream in) : base(in)
> {
> }
> /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
> /* +++ */  {
> /* +++ */  this.CultureInfo = CultureInfo;
> /* +++ */  }
>   
> public override Token Next(Token result)
> {
> result = Input.Next(result);
> if (result != null)
> {
> char[] buffer = result.TermBuffer();
> int length = result.termLength;
> for (int i = 0; i < length; i++)
> /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
> return result;
> }
> else
> return null;
> }
> }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

2009-06-27 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724923#action_12724923
 ] 

Steven Rowe commented on LUCENE-1719:
-

I also tested ICU4J version 4.2 (released 6 weeks ago), and the timings were 
nearly identical to those from ICU4J version 4.0 (the one that's in 
contrib/collation/lib/).

The timings given in the table above were not produced with the "-server" 
option to the JVM.  I separately tested all combinations using the "-server" 
option, but there was no difference for the 32-bit JVMs, though roughly 3-4% 
faster for the 64-bit JVMs.  I got the impression (didn't actually calculate) 
that although the best times of 5 runs were better for the 64-bit JVMs when 
using the "-server" option, the average times seemed to be slightly worse.  In 
any case, the performance improvement of the ICU4J implementation over the 
java.text.Collator implementation was basically unaffected by the use of the 
"-server" JVM option.


> Add javadoc notes about ICUCollationKeyFilter's speed advantage over 
> CollationKeyFilter
> ---
>
> Key: LUCENE-1719
> URL: https://issues.apache.org/jira/browse/LUCENE-1719
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Steven Rowe
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
> faster than CollationKeyFilter, the JVM-provided java.text.Collator 
> implementation in the same package.  The javadocs of these classes should be 
> modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's 
> comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
>  on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
> faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 
> 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 
> 4 languages (taken from the corresponding Debian wordlist packages and 
> truncated to the first 90k words after a fixed random shuffling), using 
> Collators at the default strength, on a Windows Vista 64-bit machine.  I used 
> an analysis pipeline consisting of WhitespaceTokenizer chained to the 
> collation key filter, so to isolate the time taken by the collation key 
> filters, I also timed WhitespaceTokenizer operating alone for each 
> combination.  The rightmost column represents the performance advantage of 
> the ICU4J implemtation (ICU) over the java.text.Collator implementation 
> (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / 
> (JVM-WST). The best times out of 5 runs for each combination, in 
> milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J 
> Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   3   >