[jira] Commented: (LUCENE-2358) rename KeywordMarkerTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851659#action_12851659 ] Steven Rowe commented on LUCENE-2358: - Sorry for cluttering this issue... {quote} I'm not really sure the KeywordAttribute is the best fit here, because its purpose is for the token to not be changed by some later filter. I'm not sure how your filter works (I would have to see the patch), but I think using this attribute for this purpose could introduce some bugs? I guess the key is that its not a private-use attribute really, these things are visible by all tokenstreams. so stemmers etc will see your 'internal' attribute. {quote} Yep, you're right, I hadn't thought it through that far. {quote} bq. Would it make sense to have a generalized boolean attribute [...]? I don't really think so. Since there can only be one of any attribute in the tokenstream, you would have various TokenFilters clashing on how they interpret and use some generic boolean attribute! {quote} Um, yes, I should have realized that... (Re-writing private FillerTokenAttribute! Hooray!) > rename KeywordMarkerTokenFilter > --- > > Key: LUCENE-2358 > URL: https://issues.apache.org/jira/browse/LUCENE-2358 > Project: Lucene - Java > Issue Type: Task > Components: Analysis >Reporter: Robert Muir >Priority: Trivial > Attachments: LUCENE-2358.patch > > > I would like to rename KeywordMarkerTokenFilter to KeywordMarkerFilter. > We havent released it yet, so its a good time to keep the name brief and > consistent. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2358) rename KeywordMarkerTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851652#action_12851652 ] Steven Rowe commented on LUCENE-2358: - Hi Robert, I'm working on a change to ShingleFilter to not output "_" filler token unigrams (or generally, filler-only ngrams, to cover the case where position increment gaps exceed n). I needed to be able to mark cached tokens as being filler tokens (or not) - a boolean attribute. After trying to write a new private-use attribute and failing (I didn't make both an interface and an implementation, I think - I should figure it out and improve the docs I guess), I found KeywordAttribute and have used it to mark whether or not a cached token is a filler token (keyword:yes => filler-token:yes). Would it make sense to have a generalized boolean attribute, specialized for keywords or (fill-in-the-blank)? It's a small leap to say that "iskeyword" means true for whatever boolean attribute you want to carry, so this isn't really a big deal, but I thought I'd bring it up while you're thinking about naming this thing. (This may be a can of worms: if there is a generic boolean attribute, should there be generic string/int/float/etc. attributes too?) Steve > rename KeywordMarkerTokenFilter > --- > > Key: LUCENE-2358 > URL: https://issues.apache.org/jira/browse/LUCENE-2358 > Project: Lucene - Java > Issue Type: Task > Components: Analysis >Reporter: Robert Muir >Priority: Trivial > Attachments: LUCENE-2358.patch > > > I would like to rename KeywordMarkerTokenFilter to KeywordMarkerFilter. > We havent released it yet, so its a good time to keep the name brief and > consistent. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842505#action_12842505 ] Steven Rowe commented on LUCENE-2302: - bq. A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef(). Mike M. noted on [LUCENE-1435|http://issues.apache.org/jira/browse/LUCENE-1435?focusedCommentId=12646667&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12646667] that the way to do "internal-to-indexing" collation is to store the original string in the term dictionary, sorted via user-specifiable collation. > Replacement for TermAttribute+Impl with extended capabilities (byte[] > support, CharSequence, Appendable) > > > Key: LUCENE-2302 > URL: https://issues.apache.org/jira/browse/LUCENE-2302 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: Flex Branch >Reporter: Uwe Schindler > Fix For: Flex Branch > > > For flexible indexing terms can be simple byte[] arrays, while the current > TermAttribute only supports char[]. This is fine for plain text, but e.g > NumericTokenStream should directly work on the byte[] array. > Also TermAttribute lacks of some interfaces that would make it simplier for > users to work with them: Appendable and CharSequence > I propose to create a new interface "CharTermAttribute" with a clean new API > that concentrates on CharSequence and Appendable. > The implementation class will simply support the old and new interface > working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of > this. So if somebody adds a TermAttribute, he will get an implementation > class that can be also used as CharTermAttribute. As both attributes create > the same impl instance both calls to addAttribute are equal. So a TokenFilter > that adds CharTermAttribute to the source will work with the same instance as > the Tokenizer that requested the (deprecated) TermAttribute. > To also support byte[] only terms like Collation or NumericField needs, a > separate getter-only interface will be added, that returns a reusable > BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will > also support this interface. For backwards compatibility with old > self-made-TermAttribute implementations, the indexer will check with > hasAttribute(), if the BytesRef getter interface is there and if not will > wrap a old-style TermAttribute (a deprecated wrapper class will be provided): > new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the > indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838081#action_12838081 ] Steven Rowe commented on LUCENE-2167: - I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax. > StandardTokenizer Javadoc does not correctly describe tokenization around > punctuation characters > > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 >Reporter: Shyamal Prasad >Priority: Minor > Attachments: LUCENE-2167.patch, LUCENE-2167.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The Javadoc for StandardTokenizer states: > {quote} > Splits words at punctuation characters, removing punctuation. > However, a dot that's not followed by whitespace is considered part of a > token. > Splits words at hyphens, unless there's a number in the token, in which case > the whole > token is interpreted as a product number and is not split. > {quote} > This is not accurate. The actual JFlex implementation treats hyphens > interchangeably with > punctuation. So, for example "video,mp4,test" results in a *single* token and > not three tokens > as the documentation would suggest. > Additionally, the documentation suggests that "video-mp4-test-again" would > become a single > token, but in reality it results in two tokens: "video-mp4-test" and "again". > IMHO the parser implementation is fine as is since it is hard to keep > everyone happy, but it is probably > worth cleaning up the documentation string. > The patch included here updates the documentation string and adds a few test > cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838081#action_12838081 ] Steven Rowe edited comment on LUCENE-2167 at 2/24/10 11:27 PM: --- I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are {{UnicodeWordBreakRules_5_\*.\*}} - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax. was (Author: steve_rowe): I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax. > StandardTokenizer Javadoc does not correctly describe tokenization around > punctuation characters > > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 >Reporter: Shyamal Prasad >Priority: Minor > Attachments: LUCENE-2167.patch, LUCENE-2167.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The Javadoc for StandardTokenizer states: > {quote} > Splits words at punctuation characters, removing punctuation. > However, a dot that's not followed by whitespace is considered part of a > token. > Splits words at hyphens, unless there's a number in the token, in which case > the whole > token is interpreted as a product number and is not split. > {quote} > This is not accurate. The actual JFlex implementation treats hyphens > interchangeably with > punctuation. So, for example "video,mp4,test" results in a *single* token and > not three tokens > as the documentation would suggest. > Additionally, the documentation suggests that "video-mp4-test-again" would > become a single > token, but in reality it results in two tokens: "video-mp4-test" and "again". > IMHO the parser implementation is fine as is since it is hard to keep > everyone happy, but it is probably > worth cleaning up the documentation string. > The patch included here updates the documentation string and adds a few test > cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806565#action_12806565 ] Steven Rowe edited comment on LUCENE-2218 at 1/29/10 11:48 PM: --- Solr support for the ShingleFilter improvements implemented here: SOLR-1740 was (Author: steve_rowe): Solr support for the ShingleFilter improvements implemented here > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806409#action_12806409 ] Steven Rowe commented on LUCENE-2218: - I see that SOLR-1674 introduced a new class TestShingleFilterFactory, but SOLR-1657 doesn't have any changes to ShingleFilterFactory, and your list in the description doesn't include it. Are there other Solr-Lucene-3.0-analysis issues I'm missing? > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 3.0 >Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806402#action_12806402 ] Steven Rowe commented on LUCENE-2218: - Thanks, Robert. I plan on creating a Solr issue to integrate these ShingleFilter changes into ShingleFilterFactory. I haven't followed your (and others') work moving Solr closer to upgrading to Lucene 3.0 - are there issues with that that I should be aware of? > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2223) ShingleFilter benchmark
[ https://issues.apache.org/jira/browse/LUCENE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801692#action_12801692 ] Steven Rowe edited comment on LUCENE-2223 at 1/18/10 7:13 AM: -- bq. This appears to work well, the only thing I would ask for is a simple test for the task (maybe especially testing the option that changes the wrapped analyzer's classname from the default std. analyzer) Done in attached patch - thanks for catching this oversight. In constructing the test, I noticed that I had not brought over the analyzer package abbreviation logic from NewAnalyzerTask; this is now present in NewShingleAnalyzerTask, so that "analyzer:WhitespaceAnalyzer" is functional as a param. *Edit*: Also removed some debug printing I'd forgotten to remove from NewShingleAnalyzerTask. was (Author: steve_rowe): bq. This appears to work well, the only thing I would ask for is a simple test for the task (maybe especially testing the option that changes the wrapped analyzer's classname from the default std. analyzer) Done in attached patch - thanks for catching this oversight. In constructing the test, I noticed that I had not brought over the analyzer package abbreviation logic from NewAnalyzerTask; this is now present in NewShingleAnalyzerTask, so that "analyzer:WhitespaceAnalyzer" is functional as a param. > ShingleFilter benchmark > --- > > Key: LUCENE-2223 > URL: https://issues.apache.org/jira/browse/LUCENE-2223 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2223.patch, LUCENE-2223.patch > > > Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new > task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: > NewShingleAnalyzerTask. > The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default > StandardAnalyzer, with 4 different configurations over 10,000 Reuters > documents each. To allow ShingleFilter timings to be isolated from the rest > of the pipeline, StandardAnalyzer is also run over the same set of Reuters > documents. This set of 5 runs is then run 5 times. > The patch includes two perl scripts, the first to output JIRA table formatted > timing information, with the minimum elapsed time for each of the 4 > ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to > compare two runs' JIRA output, producing another JIRA table showing % > improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2223) ShingleFilter benchmark
[ https://issues.apache.org/jira/browse/LUCENE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2223: Attachment: LUCENE-2223.patch bq. This appears to work well, the only thing I would ask for is a simple test for the task (maybe especially testing the option that changes the wrapped analyzer's classname from the default std. analyzer) Done in attached patch - thanks for catching this oversight. In constructing the test, I noticed that I had not brought over the analyzer package abbreviation logic from NewAnalyzerTask; this is now present in NewShingleAnalyzerTask, so that "analyzer:WhitespaceAnalyzer" is functional as a param. > ShingleFilter benchmark > --- > > Key: LUCENE-2223 > URL: https://issues.apache.org/jira/browse/LUCENE-2223 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2223.patch, LUCENE-2223.patch > > > Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new > task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: > NewShingleAnalyzerTask. > The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default > StandardAnalyzer, with 4 different configurations over 10,000 Reuters > documents each. To allow ShingleFilter timings to be isolated from the rest > of the pipeline, StandardAnalyzer is also run over the same set of Reuters > documents. This set of 5 runs is then run 5 times. > The patch includes two perl scripts, the first to output JIRA table formatted > timing information, with the minimum elapsed time for each of the 4 > ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to > compare two runs' JIRA output, producing another JIRA table showing % > improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801651#action_12801651 ] Steven Rowe commented on LUCENE-2218: - bq. I made a trivial change: shingleFilterTestCommon is implemented with assertTokenStreamContents, for better checking. It now recently does some good sanity checks for things like clearAttributes, even with save/restore state, etc. no change to the code, tests all still pass. Cool, thanks. FYI, you named your patch LUCENE-2118.patch instead of LUCENE-2218.patch. > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801649#action_12801649 ] Steven Rowe edited comment on LUCENE-2218 at 1/18/10 2:20 AM: -- bq. hey, want to break the benchmark out into a separate jira issue for simplicity? Done - see LUCENE-2223. Deleted benchmark patches from this issue. was (Author: steve_rowe): bq. hey, want to break the benchmark out into a separate jira issue for simplicity? Done - see LUCENE-2223. > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2218: Attachment: (was: LUCENE-2218.benchmark.patch) > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 > Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2218: Attachment: (was: LUCENE-2218.benchmark.patch) > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 > Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2218: Attachment: (was: LUCENE-2218.benchmark.patch) > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 > Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801649#action_12801649 ] Steven Rowe commented on LUCENE-2218: - bq. hey, want to break the benchmark out into a separate jira issue for simplicity? Done - see LUCENE-2223. > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2118.patch, LUCENE-2218.benchmark.patch, > LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2223) ShingleFilter benchmark
[ https://issues.apache.org/jira/browse/LUCENE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2223: Attachment: LUCENE-2223.patch ShingleFilter benchmark patch attached. Use "ant shingle" to produce JIRA table formatted output. > ShingleFilter benchmark > --- > > Key: LUCENE-2223 > URL: https://issues.apache.org/jira/browse/LUCENE-2223 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2223.patch > > > Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new > task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: > NewShingleAnalyzerTask. > The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default > StandardAnalyzer, with 4 different configurations over 10,000 Reuters > documents each. To allow ShingleFilter timings to be isolated from the rest > of the pipeline, StandardAnalyzer is also run over the same set of Reuters > documents. This set of 5 runs is then run 5 times. > The patch includes two perl scripts, the first to output JIRA table formatted > timing information, with the minimum elapsed time for each of the 4 > ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to > compare two runs' JIRA output, producing another JIRA table showing % > improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2223) ShingleFilter benchmark
ShingleFilter benchmark --- Key: LUCENE-2223 URL: https://issues.apache.org/jira/browse/LUCENE-2223 Project: Lucene - Java Issue Type: New Feature Components: contrib/benchmark Affects Versions: 3.0 Reporter: Steven Rowe Priority: Minor Spawned from LUCENE-2218: a benchmark for ShingleFilter, along with a new task to instantiate (non-default-constructor) ShingleAnalyzerWrapper: NewShingleAnalyzerTask. The included shingle.alg runs ShingleAnalyzerWrapper, wrapping the default StandardAnalyzer, with 4 different configurations over 10,000 Reuters documents each. To allow ShingleFilter timings to be isolated from the rest of the pipeline, StandardAnalyzer is also run over the same set of Reuters documents. This set of 5 runs is then run 5 times. The patch includes two perl scripts, the first to output JIRA table formatted timing information, with the minimum elapsed time for each of the 4 ShingleAnalyzerWrapper runs and the StandardAnalyzer run, and the second to compare two runs' JIRA output, producing another JIRA table showing % improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801503#action_12801503 ] Steven Rowe commented on LUCENE-2218: - I think these patches are now ready to go. > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, > LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801502#action_12801502 ] Steven Rowe commented on LUCENE-2218: - New output from the fixed benchmark script - no change in the ShingleFilter patch: JAVA: java version "1.5.0_15" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode) OS: cygwin WinVistaService Pack 2 Service Pack 26060022202561 ||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement|| |2|no|5.03s|4.62s|2.18s|16.8%| |2|yes|5.20s|4.84s|2.18s|13.5%| |4|no|6.42s|5.70s|2.18s|20.5%| |4|yes|6.53s|5.89s|2.18s|17.3%| > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, > LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2218: Attachment: LUCENE-2218.benchmark.patch In {{compare.shingle.benchmark.tables.pl}}, a missing decimal point caused overinflated improvement figures. Fixed in this patch. > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 > Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, > LUCENE-2218.benchmark.patch, LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2218: Attachment: LUCENE-2218.benchmark.patch Output table produced by {{compare.shingle.benchmark.tables.pl}} now has "s" (for seconds) in the elapsed time columns > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, > LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801341#action_12801341 ] Steven Rowe edited comment on LUCENE-2218 at 1/17/10 5:17 PM: -- The rewrite included some optimizations (e.g., no longer constructing n StringBuilders for every position in the input stream), and the performance is now modestly better - below is a comparison generated using the attached benchmark patch: JAVA: java version "1.5.0_15" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode) OS: cygwin WinVistaService Pack 2 Service Pack 26060022202561 ||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement|| |2|no|4.92s|4.74s|2.19s|7.5%| |2|yes|5.04s|4.90s|2.19s|5.6%| |4|no|6.21s|5.82s|2.19s|11.2%| |4|yes|6.41s|5.97s|2.19s|12.1%| was (Author: steve_rowe): The rewrite included some optimizations (e.g., no longer constructing n StringBuilders for every position in the input stream), and the performance is now modestly better - below is a comparison generated using the attached benchmark patch: JAVA: java version "1.5.0_15" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode) OS: cygwin WinVistaService Pack 2 Service Pack 26060022202561 ||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement|| |2|no|4.92|4.74|2.19|7.5%| |2|yes|5.04|4.90|2.19|5.6%| |4|no|6.21|5.82|2.19|11.2%| |4|yes|6.41|5.97|2.19|12.1%| > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, > LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801341#action_12801341 ] Steven Rowe commented on LUCENE-2218: - The rewrite included some optimizations (e.g., no longer constructing n StringBuilders for every position in the input stream), and the performance is now modestly better - below is a comparison generated using the attached benchmark patch: JAVA: java version "1.5.0_15" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode) OS: cygwin WinVistaService Pack 2 Service Pack 26060022202561 ||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement|| |2|no|4.92|4.74|2.19|7.5%| |2|yes|5.04|4.90|2.19|5.6%| |4|no|6.21|5.82|2.19|11.2%| |4|yes|6.41|5.97|2.19|12.1%| > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2218) ShingleFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2218: Attachment: LUCENE-2218.benchmark.patch LUCENE-2218.patch Patch implementing new features, and a patch for a new contrib/benchmark target "shingle", including a new task NewShingleAnalyzerTask. ShingleFilter is largely rewritten here in order to support the new configurable minimum shingle size. > ShingleFilter improvements > -- > > Key: LUCENE-2218 > URL: https://issues.apache.org/jira/browse/LUCENE-2218 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > Attachments: LUCENE-2218.benchmark.patch, LUCENE-2218.patch > > > ShingleFilter should allow configuration of minimum shingle size (in addition > to maximum shingle size), so that it's possible to (e.g.) output only > trigrams instead of bigrams mixed with trigrams. The token separator used in > composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2218) ShingleFilter improvements
ShingleFilter improvements -- Key: LUCENE-2218 URL: https://issues.apache.org/jira/browse/LUCENE-2218 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 3.0 Reporter: Steven Rowe Priority: Minor ShingleFilter should allow configuration of minimum shingle size (in addition to maximum shingle size), so that it's possible to (e.g.) output only trigrams instead of bigrams mixed with trigrams. The token separator used in composing shingles should be configurable too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799056#action_12799056 ] Steven Rowe commented on LUCENE-2181: - +1, once again, tests all pass, and "ant collation" produced expected output. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798679#action_12798679 ] Steven Rowe commented on LUCENE-2181: - +1, tests all pass, and "ant collation" produced expected output. One minor detail, though - shouldn't the output files be renamed to identify their purpose, similarly to how you renamed bm2jira.pl? Here's the relevant section in {{contrib/benchmark/build.txt}}: {code:xml} {code} > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798590#action_12798590 ] Steven Rowe commented on LUCENE-2181: - {quote} Steven I also havent forgotten about your other contribution, the thing that creates the benchmark corpus in the first place from wikipedia. One idea I had would be that such a thing wouldn't be too out of place in the open relevance project... (munging corpora etc) {quote} Interesting idea, thanks - I'll take a look at what's there now and see how my stuff would fit in. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798589#action_12798589 ] Steven Rowe commented on LUCENE-2181: - I think NewCollationAnalyzerTask should be a little more careful about parsing its parameters - here's a slightly modified version of your setParams() that understands "impl:jdk" and complains about unrecognized params: {code:java} @Override public void setParams(String params) { super.setParams(params); StringTokenizer st = new StringTokenizer(params, ","); while (st.hasMoreTokens()) { String param = st.nextToken(); StringTokenizer expr = new StringTokenizer(param, ":"); String key = expr.nextToken(); String value = expr.nextToken(); // for now we only support the "impl" parameter. // TODO: add strength, decomposition, etc if (key.equals("impl")) { if (value.equalsIgnoreCase("icu")) impl = Implementation.ICU; else if (value.equalsIgnoreCase("jdk")) impl = Implementation.JDK; else throw new RuntimeException("Unknown parameter " + param); } else { throw new RuntimeException("Unknown parameter " + param); } } } {code} > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798588#action_12798588 ] Steven Rowe commented on LUCENE-2181: - I just ran the contrib/benchmark tests, and I got one test failure: {noformat} [junit] Testcase: testReadTokens(org.apache.lucene.benchmark.byTask.TestPerfTasksLogic):FAILED [junit] expected:<3108> but was:<3128> [junit] junit.framework.AssertionFailedError: expected:<3108> but was:<3128> [junit] at org.apache.lucene.benchmark.byTask.TestPerfTasksLogic.testReadTokens(TestPerfTasksLogic.java:480) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:212) [junit] [junit] [junit] Test org.apache.lucene.benchmark.byTask.TestPerfTasksLogic FAILED {noformat} > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798584#action_12798584 ] Steven Rowe commented on LUCENE-2181: - Works for me: JAVA: java version "1.5.0_15" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode) OS: cygwin WinVistaService Pack 2 Service Pack 26060022202561 ||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |English|5.53s|2.03s|1.20s|422%| |French|6.41s|2.13s|1.19s|455%| |German|6.36s|2.19s|1.22s|430%| |Ukrainian|8.92s|3.62s|1.21s|220%| > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2200: Attachment: LUCENE-2200.patch bq. Robert, when you commit this make sure you mark the Attributes in EdgeNGramTokenFilter.java final thanks. Whoops, I missed those - thanks for checking, Simon. (minGram and maxGram can also be final in EdgeNGramTokenFilter.java.) I've attached a new patch that includes these changes -- all tests pass. > Several final classes have non-overriding protected members > --- > > Key: LUCENE-2200 > URL: https://issues.apache.org/jira/browse/LUCENE-2200 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 3.0 > Reporter: Steven Rowe >Assignee: Robert Muir >Priority: Trivial > Attachments: LUCENE-2200.patch, LUCENE-2200.patch, LUCENE-2200.patch > > > Protected member access in final classes, except where a protected method > overrides a superclass's protected method, makes little sense. The attached > patch converts final classes' protected access on fields to private, removes > two final classes' unused protected constructors, and converts one final > class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798520#action_12798520 ] Steven Rowe commented on LUCENE-2181: - bq. Steven, another idea: what if we simply added the options to DocMaker so we could turn off the tokenization of title and date fields? Good idea! bq. i'll update the alg file and produce a new patch Excellent, thanks! > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798519#action_12798519 ] Steven Rowe edited comment on LUCENE-2181 at 1/10/10 5:56 PM: -- bq. What about this per-field thing, what if in the data files, title and date were simply blank? Hmm, although the date field value is meaningless, I like the TF-in-title-field thing. {quote} Or should we worry, I agree its stupid, does it skew the results though? One way to look at it is that its also fairly realistic (even though its meaningless, you see numbers and dates everywhere). {quote} I was thinking that it would, and that it's not really a meaningful test of collation - who's going to bother running collation over integers and dates? - but since the comparison here is between two implementations of collation, I think you're right that there is no skew in doing this comparison: {panel} icu(kiwi) + icu(apple) + icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange) {panel} instead of this one: {panel} keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + jdk(orange) {panel} (where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for the jdk(X) transform) bq. The downside to doing per-analyzer wrapper is that it introduces some complexity, in all honesty this is not really specific to this collation task, right? (i.e. the existing analysis/tokenization benchmarks have this same problem) Yup, you're right. A general facility to do this will end up looking (modulo syntax) like Solr's per-field analysis specification. was (Author: steve_rowe): bq. What about this per-field thing, what if in the data files, title and date were simply blank? Hmm, although the date field value is meaningless, I like the TF-in-title-field thing. {quote} Or should we worry, I agree its stupid, does it skew the results though? One way to look at it is that its also fairly realistic (even though its meaningless, you see numbers and dates everywhere). {quote} I was thinking that it would, and that it's not really a meaningful test of collation - who's going to bother running collation over integers and dates? - but since the comparison here is between two implementations of collation, I think you're right that there is no skew in doing this comparison: {panel} icu(kiwi) + icu(apple) + (icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange) {panel} instead of this one: {panel} keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + jdk(orange) {panel} (where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for the jdk(X) transform) bq. The downside to doing per-analyzer wrapper is that it introduces some complexity, in all honesty this is not really specific to this collation task, right? (i.e. the existing analysis/tokenization benchmarks have this same problem) Yup, you're right. A general facility to do this will end up looking (modulo syntax) like Solr's per-field analysis specification. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798519#action_12798519 ] Steven Rowe commented on LUCENE-2181: - bq. What about this per-field thing, what if in the data files, title and date were simply blank? Hmm, although the date field value is meaningless, I like the TF-in-title-field thing. {quote} Or should we worry, I agree its stupid, does it skew the results though? One way to look at it is that its also fairly realistic (even though its meaningless, you see numbers and dates everywhere). {quote} I was thinking that it would, and that it's not really a meaningful test of collation - who's going to bother running collation over integers and dates? - but since the comparison here is between two implementations of collation, I think you're right that there is no skew in doing this comparison: {panel} icu(kiwi) + icu(apple) + (icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange) {panel} instead of this one: {panel} keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + jdk(orange) {panel} (where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for the jdk(X) transform) bq. The downside to doing per-analyzer wrapper is that it introduces some complexity, in all honesty this is not really specific to this collation task, right? (i.e. the existing analysis/tokenization benchmarks have this same problem) Yup, you're right. A general facility to do this will end up looking (modulo syntax) like Solr's per-field analysis specification. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798508#action_12798508 ] Steven Rowe commented on LUCENE-2181: - Looks good. I like the way you've integrated it into the benchmark suite, and as you say the NewLocaleTask should prove useful elsewhere. bq. I put the files in my apache directory, but modified your patch somewhat One major thing you changed but didn't mention above is that rather than applying the collation key transform only to the LineDoc body field, it's now applied also to the title and date fields. Given the nature of the top 100k words files -- the title is an integer representing term frequency, and the date is essentially meaningless (the date on which I created the file) -- I don't think this makes sense (and that's why I made analyzers that only applied collation to the body field). > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2181: Attachment: top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 LUCENE-2181.patch Hi Robert, In the new version of the patch, {{ant benchmark}} from the {{contrib/icu/}} directory attempts to download the attached {{tar.bz2}} file from {{http://people.apache.org/~rmuir/wikipedia}} (*please change this to the location where you end up putting the file*), then unpacks the archive to the {{contrib/icu/src/benchmark/work/}} directory, then compiles and runs the benchmark. In addition to the top 100K word lists, the {{tar.bz2}} file contains {{LICENSE.txt}}, which contains links to the Wikipedia dumps from which the lists were extracted, along with a link to the license that Wikipedia uses. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2181: Attachment: (was: LUCENE-2181.patch.zip) > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798396#action_12798396 ] Steven Rowe commented on LUCENE-2200: - FYI, all tests pass for me with the new version of the patch applied. > Several final classes have non-overriding protected members > --- > > Key: LUCENE-2200 > URL: https://issues.apache.org/jira/browse/LUCENE-2200 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 3.0 > Reporter: Steven Rowe >Priority: Trivial > Attachments: LUCENE-2200.patch, LUCENE-2200.patch > > > Protected member access in final classes, except where a protected method > overrides a superclass's protected method, makes little sense. The attached > patch converts final classes' protected access on fields to private, removes > two final classes' unused protected constructors, and converts one final > class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2200: Attachment: LUCENE-2200.patch bq. Could we make some of the member vars final too? Done in the new version of the patch. Note that I didn't try to look in classes other than those already modified in the previous version of the patch for final class member access modification. > Several final classes have non-overriding protected members > --- > > Key: LUCENE-2200 > URL: https://issues.apache.org/jira/browse/LUCENE-2200 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 3.0 > Reporter: Steven Rowe >Priority: Trivial > Attachments: LUCENE-2200.patch, LUCENE-2200.patch > > > Protected member access in final classes, except where a protected method > overrides a superclass's protected method, makes little sense. The attached > patch converts final classes' protected access on fields to private, removes > two final classes' unused protected constructors, and converts one final > class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798370#action_12798370 ] Steven Rowe commented on LUCENE-2200: - All tests pass with the attached patch applied. > Several final classes have non-overriding protected members > --- > > Key: LUCENE-2200 > URL: https://issues.apache.org/jira/browse/LUCENE-2200 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 3.0 > Reporter: Steven Rowe >Priority: Trivial > Attachments: LUCENE-2200.patch > > > Protected member access in final classes, except where a protected method > overrides a superclass's protected method, makes little sense. The attached > patch converts final classes' protected access on fields to private, removes > two final classes' unused protected constructors, and converts one final > class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2200) Several final classes have non-overriding protected members
[ https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2200: Attachment: LUCENE-2200.patch > Several final classes have non-overriding protected members > --- > > Key: LUCENE-2200 > URL: https://issues.apache.org/jira/browse/LUCENE-2200 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 3.0 > Reporter: Steven Rowe >Priority: Trivial > Attachments: LUCENE-2200.patch > > > Protected member access in final classes, except where a protected method > overrides a superclass's protected method, makes little sense. The attached > patch converts final classes' protected access on fields to private, removes > two final classes' unused protected constructors, and converts one final > class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2200) Several final classes have non-overriding protected members
Several final classes have non-overriding protected members --- Key: LUCENE-2200 URL: https://issues.apache.org/jira/browse/LUCENE-2200 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Reporter: Steven Rowe Priority: Trivial Protected member access in final classes, except where a protected method overrides a superclass's protected method, makes little sense. The attached patch converts final classes' protected access on fields to private, removes two final classes' unused protected constructors, and converts one final class's protected final method to private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796181#action_12796181 ] Steven Rowe commented on LUCENE-2181: - {quote} bq. ... these four files don't have Apache2 license declarations in them. We should put a README (or something like it) with these files to indicate the license. Are they really apache license? or derived from wikipedia content?... I don't think we should be putting apache license headers in these files {quote} Hmm, I just assumed that since these files were not (anything even close to) verbatim copies that they were independently licensable new works, but it's definitely more complicated than that... This looks like the place to start where licensing is concerned: http://en.wikipedia.org/wiki/Wikipedia_Copyright My (way non-expert) reading of this is that Wikipedia-derived works (and I'm pretty sure these frequency lists qualify as such) must be licensed under the [Creative Commons Attribution-Share Alike 3.0 Unported license|http://creativecommons.org/licenses/by-sa/3.0/], which does not appear to me to be entirely compatible with the Apache2 license. So I agree with you :) - with the caveat that some form of attribution and a pointer to licensing info should be included with these files. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch.zip > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796017#action_12796017 ] Steven Rowe commented on LUCENE-2181: - Works for me. I do have one concern, though: the LineDocSource parser doesn't know how to handle comments, so these four files don't have Apache2 license declarations in them. We should put a README (or something like it) with these files to indicate the license. Different subject: I'm not sure where it would go, but the code I used to produce these top-TF wikipedia files may be useful to other people - where do you think it could live? An example, maybe? > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2181.patch.zip > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2181: Attachment: LUCENE-2181.patch.zip Attached .zip'd patch (over 10MB because of the 4 languages' LineDocs) integrated into the Ant build for the ICU contrib, rather than integrated into the Benchmark build. Invoke using {{ant benchmark}} from the {{contrib/icu/}} directory. > benchmark for collation > --- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Robert Muir > Attachments: LUCENE-2181.patch.zip > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2185) add @Deprecated annotations
[ https://issues.apache.org/jira/browse/LUCENE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795340#action_12795340 ] Steven Rowe commented on LUCENE-2185: - The justification for using @Deprecated, AFAICT, is that conforming compilers are required to issue warnings for each so-annotated class/method, where compilers are *not* required to issue warnings for javadoc @deprecated tags, and although Sun compilers do this, other vendors' compilers might not. Another (similarly theoretical) argument in favor of using @Deprecated annotations is that, unlike @deprecated javadoc tags, this annotation is available via runtime reflection. A random information point: MYFACES-2135 removed all @Deprecated annotations from MyFaces code because an apparent bug in the Sun TCK flags methods bearing this annotation as changing method signatures. > add @Deprecated annotations > --- > > Key: LUCENE-2185 > URL: https://issues.apache.org/jira/browse/LUCENE-2185 > Project: Lucene - Java > Issue Type: Task >Reporter: Robert Muir >Priority: Trivial > Fix For: 3.1 > > Attachments: LUCENE-2185.patch > > > as discussed on LUCENE-2084, I think we should be consistent about use of > @Deprecated annotations if we are to use it. > This patch adds the missing annotations... unfortunately i cannot commit this > for some time, because my internet connection does not support heavy > committing (it is difficult to even upload a large patch). > So if someone wants to take it, have fun, otherwise in a week or so I will > commit it if nobody objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795337#action_12795337 ] Steven Rowe commented on LUCENE-2084: - {quote} bq. 3. Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) doesn't protect against overflow in the int multiplication by casting to long. #3 concerns me somewhat, this is an existing problem in trunk (i guess only for enormous terms, though). Should we consider backporting a fix? {quote} The current form of this calculation will correctly handle original binary content of lengths up to 136MB. IMHO the likelihood of encoding terms this enormous with IndexableBinaryStringTools is so miniscule that it's not worth the effort to backport. > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, > LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794762#action_12794762 ] Steven Rowe commented on LUCENE-2084: - Hi Robert, I took a look at the patch and found a couple of minor issues: # The newly deprecated methods should get @Deprecated annotations (in addition to the @deprecated javadoc tags) # IntelliJ tells me that the "final" modifier on some of the public static methods is not cool - AFAICT, although static implies final, it may be useful to leave this anyway, since unlike the static modifier, the final modifier disallows hiding of the method by sublasses? I dunno. (Checking Lucene source, there are many "static final" methods, so maybe I should tell IntelliJ it's not a problem.) # Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) doesn't protect against overflow in the int multiplication by casting to long. > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, LUCENE-2084.patch, > TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794593#action_12794593 ] Steven Rowe commented on LUCENE-2084: - bq. Do you think after this issue is resolved (whether it helps or doesn't help/won't fix either way) that we should open a separate issue to work on committing the benchmark so we have collation benchmarks for the future? Seems like it would be good to have in the future. Yeah, I don't know quite how to do that - the custom code wrapping ICU/CollationKeyAnalyzer is necessary because the contrib benchmark alg format can only handle zero-argument analyzer constructors (or those that take Version arguments). I think it would be useful to have a general-purpose alg syntax to handle this case without requiring custom code, and also, more generally, to allow for the construction of aribitrary analysis pipelines without requiring custom code (a la Solr schema). The alg parsing code is a bit dense though - I think it could be converted to a JFlex-generated grammar to simplify this kind of syntax extension. Can you think of an alternate way to package this benchmark that fits with current practice? > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2084: Attachment: collation.benchmark.tar.bz2 Fixed up version of {{collation.benchmark.tar.bz2}} that removes printing of progress from the {{collation/run-benchmark.sh}} script - otherwise the same as the previous version. > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2084: Attachment: (was: collation.benchmark.tar.bz2) > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2084.patch, LUCENE-2084.patch, > TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2084: Attachment: TopTFWikipediaWords.tar.bz2 TopTFWikipediaWords.tar.bz2 contains a Maven2 project to parse unpacked Wikipedia dump files, create a Lucene index from the tokens produced by the contrib WikipediaTokenizer, iterate over the indexed tokens' term docs, accumulating term frequencies, store the results in a bounded priority queue, then output contrib benchmark LineDoc format, with the title field containing the collection term frequency, the date containing the date the file was generated, and the body containing the term text. This code knows how to handle English, German, French, and Ukrainian, but could be extended for other languages. I used this project to generate the line-docs for the 4 languages' 100k most frequent terms, in the collation benchmark archive attachment on this issue. > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794590#action_12794590 ] Steven Rowe commented on LUCENE-2084: - Here are the unpatched results I got - these look quite similar to the results I posted from a custom (non-contrib-benchmark) benchmark in [the description of LUCENE-1719|https://issues.apache.org/jira/browse/LUCENE-1719#description-open] : ||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |1.5.0_15 (32-bit)|English|9.00s|4.89s|2.90s|207%| |1.5.0_15 (32-bit)|French|10.64s|5.12s|2.95s|254%| |1.5.0_15 (32-bit)|German|10.19s|5.19s|2.97s|225%| |1.5.0_15 (32-bit)|Ukrainian|13.66s|7.20s|2.96s|152%| ||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |1.5.0_15 (64-bit)|English|5.97s|2.55s|1.50s|326%| |1.5.0_15 (64-bit)|French|6.86s|2.74s|1.56s|349%| |1.5.0_15 (64-bit)|German|6.85s|2.76s|1.59s|350%| |1.5.0_15 (64-bit)|Ukrainian|9.56s|4.01s|1.56s|227%| ||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |1.6.0_13 (64-bit)|English|3.04s|2.06s|1.07s|99%| |1.6.0_13 (64-bit)|French|3.58s|2.04s|1.14s|171%| |1.6.0_13 (64-bit)|German|3.35s|2.22s|1.14s|105%| |1.6.0_13 (64-bit)|Ukrainian|4.48s|2.94s|1.21s|89%| Here are the results after applying the synced-to-trunk patch: ||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |1.5.0_15 (32-bit)|English|8.73s|4.61s|2.90s|241%| |1.5.0_15 (32-bit)|French|10.38s|4.87s|2.94s|285%| |1.5.0_15 (32-bit)|German|9.95s|4.94s|2.97s|254%| |1.5.0_15 (32-bit)|Ukrainian|13.37s|6.91s|2.90s|161%| ||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |1.5.0_15 (64-bit)|English|5.78s|2.65s|1.57s|290%| |1.5.0_15 (64-bit)|French|6.74s|2.74s|1.64s|364%| |1.5.0_15 (64-bit)|German|6.69s|2.86s|1.66s|319%| |1.5.0_15 (64-bit)|Ukrainian|9.40s|4.18s|1.62s|204%| ||Sun JVM||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |1.6.0_13 (64-bit)|English|3.06s|1.82s|1.09s|170%| |1.6.0_13 (64-bit)|French|3.36s|1.88s|1.16s|206%| |1.6.0_13 (64-bit)|German|3.40s|1.95s|1.14s|179%| |1.6.0_13 (64-bit)|Ukrainian|4.33s|2.65s|1.21s|117%| And here is a comparison of the two: ||Sun JVM||Language||java.text improvement||ICU4J improvement|| |1.5.0_15 (32-bit)|English|5.1%|16.8%| |1.5.0_15 (32-bit)|French|3.8%|12.9%| |1.5.0_15 (32-bit)|German|3.9%|13.1%| |1.5.0_15 (32-bit)|Ukrainian|2.6%|6.2%| ||Sun JVM||Language||java.text improvement||ICU4J improvement|| |1.5.0_15 (64-bit)|English|6.6%|-2.2%| |1.5.0_15 (64-bit)|French|4.4%|7.7%| |1.5.0_15 (64-bit)|German|5.0%|-2.0%| |1.5.0_15 (64-bit)|Ukrainian|3.3%|-3.7%| ||Sun JVM||Language||java.text improvement||ICU4J improvement|| |1.6.0_13 (64-bit)|English|0.5%|36.1%| |1.6.0_13 (64-bit)|French|11.4%|25.5%| |1.6.0_13 (64-bit)|German|-1.7%|33.8%| |1.6.0_13 (64-bit)|Ukrainian|5.3%|20.6%| It's not unequivocal, but there is a definite overall improvement in the patched version; I'd say these results justify applying the patch. I won't post them here, (mostly because I didn't save them :) ) but I've run the same benchmark (with some variation in the number of iterations) and noticed that while there are always a couple of places where the unpatched version appears to do slightly better, the place at which this occurs is not consistent, and the cases where the patched version improves throughput always dominate. > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794588#action_12794588 ] Steven Rowe commented on LUCENE-2084: - To run the benchmark: # Unpack {{collation.benchmark.tar.bz2}} in a full Lucene-java tree into the {{contrib/benchmark/}} directory. All contents will be put under a new directory named {{collation/}}. # Compile and jarify the localized (ICU)CollationKeyAnalyzer code: from the {{collation/}} directory, run the script {{build.test.analyzer.jar.sh}}. # From an unpatched {{java/trunk/}}, build Lucene's jars: {{ant clean package}}. # From the {{contrib/benchmark/}} directory, run the collation benchmark: {{collation/run-benchmark.sh > unpatched.collation.bm.table.txt}} # Apply the attached patch to the Lucene-java tree # From {{java/trunk/}}, build Lucene's jars: {{ant clean package}}. # From the {{contrib/benchmark/} directory, run the collation benchmark: {{collation/run-benchmark.sh > patched.collation.bm.table.txt}} # Produce the comparison table: {{collation/compare.collation.benchmark.tables.pl unpatched.collation.bm.table.txt patched.collation.bm.table.txt > collation.diff.bm.table.txt}} > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2084: Attachment: collation.benchmark.tar.bz2 Atached collation.benchmark.tar.bz2, which contains stuff to run an analysis-only contrib benchmark for the (ICU)CollationKeyAnalyzers over 4 languages: English, French, German, and Ukrainian. Included are: # For each language, a line-doc containing the most frequent 100K words from a corresponding Wikipedia dump from November 2009; # For each language, Java code for a no-argument analyzer callable from a benchmark alg, that specializes (ICU)CollationKeyAnalyzer and uses PerFieldAnalyzerWrapper to only run it over the line-doc body field # A script to compile and jarify the above analyzers # A benchmark alg running 5 iterations of 10 repetitions of analysis only over the line-doc for each language # A script to find the minimum elapsed time for each combination, and output the results as a JIRA table # A script to run the previous two scripts once for each of three JDK versions # A script to compare the output of the above script before and after applying the attached patch removing Char/ByteBuffer wrapping, and output the result as a JIRA table > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2084: Attachment: LUCENE-2084.patch synched to current trunk, after the LUCENE-2124 move > remove Byte/CharBuffer wrapping for collation key generation > > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2084.patch, LUCENE-2084.patch > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2178) Benchmark contrib should allow multiple locations in ext.classpath
[ https://issues.apache.org/jira/browse/LUCENE-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793627#action_12793627 ] Steven Rowe commented on LUCENE-2178: - Trivial patch to fix (works with single or multiple locations): {code} Index: contrib/benchmark/build.xml === --- contrib/benchmark/build.xml (revision 892657) +++ contrib/benchmark/build.xml (working copy) @@ -114,7 +114,7 @@ - + {code} > Benchmark contrib should allow multiple locations in ext.classpath > -- > > Key: LUCENE-2178 > URL: https://issues.apache.org/jira/browse/LUCENE-2178 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 3.0 >Reporter: Steven Rowe >Priority: Minor > > When {{ant run-task}} is invoked with the {{-Dbenchmark.ext.classpath=...}} > option, only a single location may be specified. If a classpath with more > than one location is specified, none of the locations is put on the classpath > for the invoked JVM. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2178) Benchmark contrib should allow multiple locations in ext.classpath
[ https://issues.apache.org/jira/browse/LUCENE-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2178: Description: When {{ant run-task}} is invoked with the {{-Dbenchmark.ext.classpath=...}} option, only a single location may be specified. If a classpath with more than one location is specified, none of the locations is put on the classpath for the invoked JVM. (was: When {{ant run-task}} is invoked with the {{-Dbenchmark.ext.classpath=...} option, only a single location may be specified. If a classpath with more than one location is specified, none of the locations is put on the classpath for the invoked JVM.) > Benchmark contrib should allow multiple locations in ext.classpath > -- > > Key: LUCENE-2178 > URL: https://issues.apache.org/jira/browse/LUCENE-2178 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 3.0 > Reporter: Steven Rowe >Priority: Minor > > When {{ant run-task}} is invoked with the {{-Dbenchmark.ext.classpath=...}} > option, only a single location may be specified. If a classpath with more > than one location is specified, none of the locations is put on the classpath > for the invoked JVM. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2178) Benchmark contrib should allow multiple locations in ext.classpath
Benchmark contrib should allow multiple locations in ext.classpath -- Key: LUCENE-2178 URL: https://issues.apache.org/jira/browse/LUCENE-2178 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 3.0 Reporter: Steven Rowe Priority: Minor When {{ant run-task}} is invoked with the {{-Dbenchmark.ext.classpath=...} option, only a single location may be specified. If a classpath with more than one location is specified, none of the locations is put on the classpath for the invoked JVM. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2124) move JDK collation to core, ICU collation to ICU contrib
[ https://issues.apache.org/jira/browse/LUCENE-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793074#action_12793074 ] Steven Rowe commented on LUCENE-2124: - Robert, I noticed something you missed in the move - here's a trivial patch: {code} Index: contrib/icu/src/java/overview.html === --- contrib/icu/src/java/overview.html (revision 892657) +++ contrib/icu/src/java/overview.html (working copy) @@ -34,7 +34,7 @@ CollationKeys. icu4j-collation-4.0.jar, a trimmed-down version of icu4j-4.0.jar that contains only the code and data needed to support collation, is included in Lucene's Subversion - repository at contrib/collation/lib/. + repository at contrib/icu/lib/. Use Cases {code} > move JDK collation to core, ICU collation to ICU contrib > > > Key: LUCENE-2124 > URL: https://issues.apache.org/jira/browse/LUCENE-2124 > Project: Lucene - Java > Issue Type: Task > Components: contrib/*, Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2124.patch, LUCENE-2124.patch > > > As mentioned on the list, I propose we move the JDK-based > CollationKeyFilter/CollationKeyAnalyzer, currently located in > contrib/collation into core for collation support (language-sensitive sorting) > These are not much code (the heavy duty stuff is already in core, > IndexableBinaryString). > And I would also like to move the > ICUCollationKeyFilter/ICUCollationKeyAnalyzer (along with the jar file they > depend on) also currently located in contrib/collation into a contrib/icu. > This way, we can start looking at integrating other functionality from ICU > into a fully-fleshed out icu contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2124) move JDK collation to core, ICU collation to ICU contrib
[ https://issues.apache.org/jira/browse/LUCENE-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787654#action_12787654 ] Steven Rowe commented on LUCENE-2124: - bq. this will move the contrib/collation JDK-based components to core +1 bq. and later we should consider deprecating the alternatives that are not scalable. The alternatives don't scale well, true, but they don't result in non-human-readable index terms, either, so for people that need human-readable index terms and who have a low-cardinality term set, maybe we should leave the alternatives in place? bq. this will move the contrib/collation ICU based components to contrib/iCU, and this is where I want to bring the unicode 5.2 support. +1 > move JDK collation to core, ICU collation to ICU contrib > > > Key: LUCENE-2124 > URL: https://issues.apache.org/jira/browse/LUCENE-2124 > Project: Lucene - Java > Issue Type: Task > Components: contrib/*, Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2124.patch, LUCENE-2124.patch > > > As mentioned on the list, I propose we move the JDK-based > CollationKeyFilter/CollationKeyAnalyzer, currently located in > contrib/collation into core for collation support (language-sensitive sorting) > These are not much code (the heavy duty stuff is already in core, > IndexableBinaryString). > And I would also like to move the > ICUCollationKeyFilter/ICUCollationKeyAnalyzer (along with the jar file they > depend on) also currently located in contrib/collation into a contrib/icu. > This way, we can start looking at integrating other functionality from ICU > into a fully-fleshed out icu contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785524#action_12785524 ] Steven Rowe commented on LUCENE-2074: - Thanks, Uwe, that makes sense. My bad, I only skimmed the patch, and misunderstood "3.0" in one of the new files to refer to the Lucene version, not the Unicode version. :) > Use a separate JFlex generated Unicode 4 by Java 5 compatible > StandardTokenizer > --- > > Key: LUCENE-2074 > URL: https://issues.apache.org/jira/browse/LUCENE-2074 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, > LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, > LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch > > > The current trunk version of StandardTokenizerImpl was generated by Java 1.4 > (according to the warning). In Java 3.0 we switch to Java 1.5, so we should > regenerate the file. > After regeneration the Tokenizer behaves different for some characters. > Because of that we should only use the new TokenizerImpl when > Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785414#action_12785414 ] Steven Rowe commented on LUCENE-2074: - bq. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment? I think it's fine to do that. bq. The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from source does need to use jflex. And the parsers for StandardTokenizer are verified to work correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case. Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms - the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], which under JFlex <1.5 were expanded using the scanner-generation-time JVM's Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT depend on the declared Unicode version definitions (i.e., [:letter:] = \p{Letter}). I'm actually surprised that the DFAs are identical, since I'm almost certain that the set of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend. > Use a separate JFlex generated Unicode 4 by Java 5 compatible > StandardTokenizer > --- > > Key: LUCENE-2074 > URL: https://issues.apache.org/jira/browse/LUCENE-2074 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, > LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, > LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch > > > The current trunk version of StandardTokenizerImpl was generated by Java 1.4 > (according to the warning). In Java 3.0 we switch to Java 1.5, so we should > regenerate the file. > After regeneration the Tokenizer behaves different for some characters. > Because of that we should only use the new TokenizerImpl when > Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785344#action_12785344 ] Steven Rowe commented on LUCENE-2074: - bq. Will the old jflex fail on %unicode {x.y} syntax ??? I haven't tested it, but JFlex <1.5 likely will fail on this syntax, since nothing is expected after the %unicode directive. bq. Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy. I think the JFlex 1.5 release will happen before March of next year, since we're down to just a few blocking issues. > Use a separate JFlex generated Unicode 4 by Java 5 compatible > StandardTokenizer > --- > > Key: LUCENE-2074 > URL: https://issues.apache.org/jira/browse/LUCENE-2074 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, > LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, > LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch > > > The current trunk version of StandardTokenizerImpl was generated by Java 1.4 > (according to the warning). In Java 3.0 we switch to Java 1.5, so we should > regenerate the file. > After regeneration the Tokenizer behaves different for some characters. > Because of that we should only use the new TokenizerImpl when > Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2077) changes-to-html: better handling of bulleted lists in CHANGES.txt
[ https://issues.apache.org/jira/browse/LUCENE-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2077: Attachment: LUCENE-2077.patch Patch to handle bulleted lists in CHANGES.txt, and remove tag workarounds from CHANGES.txt. > changes-to-html: better handling of bulleted lists in CHANGES.txt > - > > Key: LUCENE-2077 > URL: https://issues.apache.org/jira/browse/LUCENE-2077 > Project: Lucene - Java > Issue Type: Improvement > Components: Website >Affects Versions: 2.9.1 > Reporter: Steven Rowe >Priority: Trivial > Fix For: 3.0 > > Attachments: LUCENE-2077.patch > > > - bulleted lists > - should be rendered > - as such > - in output HTML -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2077) changes-to-html: better handling of bulleted lists in CHANGES.txt
changes-to-html: better handling of bulleted lists in CHANGES.txt - Key: LUCENE-2077 URL: https://issues.apache.org/jira/browse/LUCENE-2077 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.9.1 Reporter: Steven Rowe Priority: Trivial Fix For: 3.0 - bulleted lists - should be rendered - as such - in output HTML -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1689) supplementary character handling
[ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778528#action_12778528 ] Steven Rowe commented on LUCENE-1689: - I don't know if this is the right place to point this out, but: JFlex-generated scanners (e.g. StandardAnalyzer) do not properly handle supplementary characters. Unfortunately, it looks like the as-yet-unreleased JFlex 1.5 will not support supplementary characters either, so this will be a gap in Lucene's Unicode handling for a while. > supplementary character handling > > > Key: LUCENE-1689 > URL: https://issues.apache.org/jira/browse/LUCENE-1689 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, > LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt > > > for Java 5. Java 5 is based on unicode 4, which means variable-width encoding. > supplementary character support should be fixed for code that works with > char/char[] > For example: > StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be > changed so they don't actually remove suppl characters, or modified to look > for surrogates and behave correctly. > LowercaseFilter should be modified to lowercase suppl. characters correctly. > CharTokenizer should either be deprecated or changed so that isTokenChar() > and normalize() use int. > in all of these cases code should remain optimized for the BMP case, and > suppl characters should be the exception, but still work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772211#action_12772211 ] Steven Rowe commented on LUCENE-2019: - {quote} Steven, by the way, I think something i havent been able to communicate properly, is that I feel very strongly that storing noncharacters in term text where they are treated as abstract characters, is very different than using them as sentinel values / delimiters / etc in the index format, I think this is ok and is what they are for. but term text is different, search engines index human language and by putting noncharacters in term text you are treating them as abstract characters. {quote} Robert, you are a proponent of the (ICU)CollationKeyFilter functionality, which uses IndexableBinaryStringTools to store arbitrary binary data in a Lucene index. These filters store non-human-readable terms in the index. I can think of several other examples of using Lucene indexes to store non-human-language terms. Character data, in addition to representing characters, is *data*. Bits. I would argue that you *always* need context to figure out what bits represent. > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772184#action_12772184 ] Steven Rowe commented on LUCENE-2019: - bq. by disallowing all noncharacters as term text, lucene is *more free* to use them as delimiters, and sentinel values, and such, as specified in chapter 3 of the standard. Lucene is more free, but Lucene's users are not. Quite the contrary. IMHO, Lucene's users (applications that incorporate the Lucene library) should be able to use Unicode data in ways that the standard allows ("Applications are free to use any of these noncharacter code points internally"). U+ was chosen for Lucene-internal use for reasons very similar to those you're bringing up, Robert: something like "who would ever want to use non-characters in an index?" However, this choice does not obligate Lucene to take the same action for all other non-characters. I think the fix here is documentation, not proscription. > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772174#action_12772174 ] Steven Rowe commented on LUCENE-2019: - bq. if you disagree with this patch, then you should also disagree with treating U+ special! Quoting myself from an earlier comment on this issue (apoligies): bq. Instituting this consistency precludes Lucene-index-as-process-internal use cases. I would argue that the price of consistency is in this case too high. So you think that enforcing consistency is worth the cost of disallowing some usages, and I don't. > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772164#action_12772164 ] Steven Rowe commented on LUCENE-2019: - Lucene is not an application. Again, quoting from section 16.7 (emphasis mine): bq. *Applications* are free to use any of these noncharacter code points internally but should never attempt to exchange them. The forbidden operation is exchanging non-characters across the *application* boundary. Asking Lucene to store non-characters for you is not a violation of the Unicode standard. Lucene agreeing to do so is not a violation of the Unicode standard. If a Lucene user later uses a Lucene index to exchange data (of whatever form) across the application boundary, that's on the user, not on Lucene. (I'll skip the Lucene-as-a-weapon metaphor. You can thank me later.) > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772151#action_12772151 ] Steven Rowe commented on LUCENE-2019: - bq. process-internal is somethign that won't be stored or interchanged in any way (internal to the process) Right, this is the crux of the disagreement: you think storage (with the exception of in-memory usage) means interchange. I and Yonik think that storage does not necessarily mean interchange. Section 16.7 (_Noncharacters_) of the Unicode 5.0.0 standand (the latest version for which an electronic version of this chapter is available), says: {quote} Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data. See Section 3.4, Characters and Encoding, for the formal definition of noncharacters and conformance requirements related to their use. The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+ on the BMP, U+1FFFE and U+1 on Plane 1, and so on, up to U+10FFFE and U+10 on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not "Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any other way from the other noncharacters, except in their code point values. Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See conformance clause C7 in Section 3.2, Conformance Requirements.) In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses. *U+ and U+10.* These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+ is associated with the largest 16-bit code unit value, 16. U+10 is associated with the largest legal UTF-32 32-bit code unit value, 1016. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on. {quote} (I left out the last part about U+FFFE.) Again, the crux of the matter is the definition of "open interchange". > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772133#action_12772133 ] Steven Rowe commented on LUCENE-2019: - bq. Steven, the only reason I might disagree is that a Lucene Index is supposed to be portable across different languages other than Lucene Java. Right, but not all Lucene indexes in-the-wild are accessed from more than one language. The vast majority of Lucene index uses, I'd venture to guess, are single-language, single-process uses. bq. in my opinion, if you are to store process-internal codepoints as abstract characters in terms, then you should not claim that Lucene indexes are in any Unicode format, because then they violate the standard. I strongly disagree with the assumption that interchange and serialization are synonymous. bq. By *not* storing them in terms, then you are free to use them as delimiters, or other purposes. right now U+ is used as a delimiter, but who knows, maybe someday you might need more? I actually agree with this argument. What if Lucene needs more process-internal characters? I don't have any way of gauging the probability that it will in the future (other than the last eight years of history, during which only one was deemed necessary). But what does Mike M. say? "Design for now" or something like that? > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
[ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772118#action_12772118 ] Steven Rowe commented on LUCENE-2019: - Lucene indexes can be used both process-internally and across processes (e.g. Solr). This patch enforces the Lucene-index-as-process-external view, and excludes the possiblity that a Lucene index is used process-internally. Since Lucene itself uses U+ internally, no clients can use it for their own purposes. This patch rationalizes handling of internal-use-only characters, such that Lucene's behavior is made consistent for all of them. Instituting this consistency precludes Lucene-index-as-process-internal use cases. I would argue that the price of consistency is in this case too high. My vote: document the crap out of the U+ Lucene-internal-use character and drop this patch. If people want to use internal-use-only characters in Lucene indexes, as long as Lucene doesn't reserve them for its own use, why stop them? > map unicode process-internal codepoints to replacement character > > > Key: LUCENE-2019 > URL: https://issues.apache.org/jira/browse/LUCENE-2019 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Robert Muir >Priority: Minor > Attachments: LUCENE-2019.patch > > > A spinoff from LUCENE-2016. > There are several process-internal codepoints in unicode, we should not store > these in the index. > Instead they should be mapped to replacement character (U+FFFD), so they can > be used process-internally. > An example of this is how Lucene Java currently uses U+ > process-internally, it can't be in the index or will cause problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1902) Changes.html not explicitly included in release
[ https://issues.apache.org/jira/browse/LUCENE-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752770#action_12752770 ] Steven Rowe commented on LUCENE-1902: - Maybe *Main* should be changed to be the conventional *Core* (the standard term when differentiating from *Contrib*) in the new Changes menu? > Changes.html not explicitly included in release > --- > > Key: LUCENE-1902 > URL: https://issues.apache.org/jira/browse/LUCENE-1902 > Project: Lucene - Java > Issue Type: Bug >Reporter: Hoss Man >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1902.patch, LUCENE-1902.patch > > > None of the release related ant targets explicitly call cahnges-to-html ... > this seems like an oversight. (currently it's only called as part of the > nightly target) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes
[ https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752323#action_12752323 ] Steven Rowe edited comment on LUCENE-1898 at 9/7/09 8:55 PM: - Patch to changes2html.pl that can handle '\*' as bulleted item indicator. Also converts numbered items in contrib/CHANGES.txt for 2.9 release to '\*' bullets. This patch incorporates Mark's numbered->bulleted modifications to CHANGES.txt, as well as correcting one numbered item that Mark missed, and converting tabs to spaces in the first section, so that the method parameters line up in the output HTML. was (Author: steve_rowe): Patch to changes2html.pl that can handle '*' as bulleted item indicator. Also converts numbered items in contrib/CHANGES.txt for 2.9 release to '*' bullets. This patch incorporates Mark's numbered->bulleted modifications to CHANGES.txt, as well as correcting one numbered item that Mark missed, and converting tabs to spaces in the first section, so that the method parameters line up in the output HTML. > Decide if we should remove lines numbers from latest Changes > > > Key: LUCENE-1898 > URL: https://issues.apache.org/jira/browse/LUCENE-1898 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1898.patch, LUCENE-1898.patch > > > As Lucene dev has grown, a new issue has arisen - many times, new changes > invalidate old changes. A proper changes file should just list the changes > from the last version, not document the dev life of the issues. Keeping > changes in proper order now requires a lot of renumbering sometimes. The > numbers have no real meaning and could be added to more rich versions (such > as the html version) automatically if desired. > I think an * makes a good replacement myself. The issues already have ids > that are stable, rather than the current, decorational numbers which are > subject to change over a dev cycle. > I think we should replace the numbers with an asterix for the 2.9 section and > going forward (ie 4. becomes *). > If we don't get consensus very quickly, this issue won't block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes
[ https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752286#action_12752286 ] Steven Rowe edited comment on LUCENE-1898 at 9/7/09 8:56 PM: - {{changes2html.pl}} doesn't fully grok the new format - items are numbered, but the asterisks are left in in some cases. I'll work up a patch. was (Author: steve_rowe): {{changes-to-html.pl}} doesn't fully grok the new format - items are numbered, but the asterisks are left in in some cases. I'll work up a patch. > Decide if we should remove lines numbers from latest Changes > > > Key: LUCENE-1898 > URL: https://issues.apache.org/jira/browse/LUCENE-1898 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1898.patch, LUCENE-1898.patch > > > As Lucene dev has grown, a new issue has arisen - many times, new changes > invalidate old changes. A proper changes file should just list the changes > from the last version, not document the dev life of the issues. Keeping > changes in proper order now requires a lot of renumbering sometimes. The > numbers have no real meaning and could be added to more rich versions (such > as the html version) automatically if desired. > I think an * makes a good replacement myself. The issues already have ids > that are stable, rather than the current, decorational numbers which are > subject to change over a dev cycle. > I think we should replace the numbers with an asterix for the 2.9 section and > going forward (ie 4. becomes *). > If we don't get consensus very quickly, this issue won't block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes
[ https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1898: Attachment: LUCENE-1898.patch Patch to changes2html.pl that can handle '*' as bulleted item indicator. Also converts numbered items in contrib/CHANGES.txt for 2.9 release to '*' bullets. This patch incorporates Mark's numbered->bulleted modifications to CHANGES.txt, as well as correcting one numbered item that Mark missed, and converting tabs to spaces in the first section, so that the method parameters line up in the output HTML. > Decide if we should remove lines numbers from latest Changes > > > Key: LUCENE-1898 > URL: https://issues.apache.org/jira/browse/LUCENE-1898 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1898.patch, LUCENE-1898.patch > > > As Lucene dev has grown, a new issue has arisen - many times, new changes > invalidate old changes. A proper changes file should just list the changes > from the last version, not document the dev life of the issues. Keeping > changes in proper order now requires a lot of renumbering sometimes. The > numbers have no real meaning and could be added to more rich versions (such > as the html version) automatically if desired. > I think an * makes a good replacement myself. The issues already have ids > that are stable, rather than the current, decorational numbers which are > subject to change over a dev cycle. > I think we should replace the numbers with an asterix for the 2.9 section and > going forward (ie 4. becomes *). > If we don't get consensus very quickly, this issue won't block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1898) Decide if we should remove lines numbers from latest Changes
[ https://issues.apache.org/jira/browse/LUCENE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752286#action_12752286 ] Steven Rowe commented on LUCENE-1898: - {{changes-to-html.pl}} doesn't fully grok the new format - items are numbered, but the asterisks are left in in some cases. I'll work up a patch. > Decide if we should remove lines numbers from latest Changes > > > Key: LUCENE-1898 > URL: https://issues.apache.org/jira/browse/LUCENE-1898 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1898.patch > > > As Lucene dev has grown, a new issue has arisen - many times, new changes > invalidate old changes. A proper changes file should just list the changes > from the last version, not document the dev life of the issues. Keeping > changes in proper order now requires a lot of renumbering sometimes. The > numbers have no real meaning and could be added to more rich versions (such > as the html version) automatically if desired. > I think an * makes a good replacement myself. The issues already have ids > that are stable, rather than the current, decorational numbers which are > subject to change over a dev cycle. > I think we should replace the numbers with an asterix for the 2.9 section and > going forward (ie 4. becomes *). > If we don't get consensus very quickly, this issue won't block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
[ https://issues.apache.org/jira/browse/LUCENE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750477#action_12750477 ] Steven Rowe commented on LUCENE-1883: - I searched just now, but couldn't find, an email thread I recall on java-dev between Doug Cutting and the RM at that point (several years ago) about modifying past releases' CHANGES.txt entries. Doug's position, articulated both in that thread (and elsewhere, IIRC), was that people depend on being able to do a diff between CHANGES.txt versions, so once a release was cut, the release notes should never change thereafter. > Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release > - > > Key: LUCENE-1883 > URL: https://issues.apache.org/jira/browse/LUCENE-1883 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1883.patch > > > I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt. (Once they > make it past a release, they're set in stone...) > Will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
[ https://issues.apache.org/jira/browse/LUCENE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1883: Attachment: LUCENE-1883.patch patch with typos corrected > Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release > - > > Key: LUCENE-1883 > URL: https://issues.apache.org/jira/browse/LUCENE-1883 > Project: Lucene - Java > Issue Type: Improvement > Components: Other > Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1883.patch > > > I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt. (Once they > make it past a release, they're set in stone...) > Will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release - Key: LUCENE-1883 URL: https://issues.apache.org/jira/browse/LUCENE-1883 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Steven Rowe Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1883.patch I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt. (Once they make it past a release, they're set in stone...) Will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1883) Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release
[ https://issues.apache.org/jira/browse/LUCENE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1883: Lucene Fields: [New, Patch Available] (was: [New]) > Fix typos in CHANGES.txt and contrib/CHANGES.txt prior to 2.9 release > - > > Key: LUCENE-1883 > URL: https://issues.apache.org/jira/browse/LUCENE-1883 > Project: Lucene - Java > Issue Type: Improvement > Components: Other > Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1883.patch > > > I noticed a few typos in CHANGES.txt and contrib/CHANGES.txt. (Once they > make it past a release, they're set in stone...) > Will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1873) Update site lucene-sandbox page
[ https://issues.apache.org/jira/browse/LUCENE-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750148#action_12750148 ] Steven Rowe commented on LUCENE-1873: - I think we should add generation of {{Contrib-Changes.html}} from {{contrib/CHANGES.txt}} to the {{changes-to-html}} target in {{build.xml}}: {code:xml} {code} and then link to it from near the top of {{lucene-sandbox/index.xml}}, something like: {code:html} See http://lucene.apache.org/java/2_9_0/changes/Contrib-Changes.html";>Contrib CHANGES for changes included in the current release. {code} > Update site lucene-sandbox page > --- > > Key: LUCENE-1873 > URL: https://issues.apache.org/jira/browse/LUCENE-1873 > Project: Lucene - Java > Issue Type: Bug >Reporter: Mark Miller >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1873.patch > > > The page has misleading/bad info. One thing I would like to do - but I won't > attempt now (prob good for the modules issue) - is commit to one word - > contrib or sandbox. I think sandbox should be purged myself. > The current page says that the sandbox is kind of a rats nest with various > early stage software that one day may make it into core - that info is > outdated I think. We should replace it, and also specify how the back compat > policy works in contrib eg each contrib can have its own policy, with the > default being no policy. > We should also drop the piece about being open to Lucene's committers and > others - a bit outdated. > We should also either include the other contribs, or change the wording to > indicate that the list is only a sampling of the many contribs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1865) Add a ton of missing license headers throughout test/demo/contrib
[ https://issues.apache.org/jira/browse/LUCENE-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750128#action_12750128 ] Steven Rowe commented on LUCENE-1865: - Two minor license nits: * Mark's r808567 commit under this issue added license declarations to two files that already had them, though the original declarations are slightly differently worded (they contain copyright notices). These two files now each contain two license declarations: {{contrib/benchmark/src/java/org/apache/lucene/benchmark/package.html}} {{contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/package.html}} * I don't know if it matters, but the following three files contain license declarations that include copyright notices ("Copyright 2005 The Apache Software Foundation"), unlike all the license declarations Mark added recently: {{contrib/instantiated/src/java/org/apache/lucene/store/instantiated/package.html}} {{src/java/org/apache/lucene/search/function/package.html}} {{src/java/org/apache/lucene/search/payloads/package.html}} > Add a ton of missing license headers throughout test/demo/contrib > - > > Key: LUCENE-1865 > URL: https://issues.apache.org/jira/browse/LUCENE-1865 > Project: Lucene - Java > Issue Type: Task >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1865-part2.patch, LUCENE-1865.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1880) Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public
[ https://issues.apache.org/jira/browse/LUCENE-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1880: Attachment: LUCENE-1880.patch trivial patch adding public access to currently package private constructors > Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public > > > Key: LUCENE-1880 > URL: https://issues.apache.org/jira/browse/LUCENE-1880 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1880.patch > > > In contrib/collation, the constructors for CollationKeyAnalyzer and > ICUCollationKeyAnalyzer are package private, and so are effectively unusable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1880) Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public
[ https://issues.apache.org/jira/browse/LUCENE-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1880: Lucene Fields: [New, Patch Available] (was: [New]) > Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public > > > Key: LUCENE-1880 > URL: https://issues.apache.org/jira/browse/LUCENE-1880 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1880.patch > > > In contrib/collation, the constructors for CollationKeyAnalyzer and > ICUCollationKeyAnalyzer are package private, and so are effectively unusable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1880) Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public
Make contrib/collation/(ICU)CollationKeyAnalyzer constructors public Key: LUCENE-1880 URL: https://issues.apache.org/jira/browse/LUCENE-1880 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Steven Rowe Priority: Trivial Fix For: 2.9 In contrib/collation, the constructors for CollationKeyAnalyzer and ICUCollationKeyAnalyzer are package private, and so are effectively unusable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1876) Some contrib packages are missing a package.html
[ https://issues.apache.org/jira/browse/LUCENE-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1876: Attachment: collation-package.html Here is {{package.html}} for contrib/collation, with content mostly stolen from class comments and test cases. The Turkish collation example is mostly stolen from Robert Muir's TestTurkishCollation.java from LUCENE-1581. > Some contrib packages are missing a package.html > > > Key: LUCENE-1876 > URL: https://issues.apache.org/jira/browse/LUCENE-1876 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Mark Miller >Priority: Trivial > Fix For: 2.9 > > Attachments: collation-package.html > > > Dunno if we will get to this one this release, but a few contribs don't have > a package.html (or a good overview that would work as a replacement) - I > don't think this is hugely important, but I think it is important - you > should be able to easily and quickly read a quick overview for each contrib I > think. > So far I have identified collation and spatial. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1873) Update site lucene-sandbox page
[ https://issues.apache.org/jira/browse/LUCENE-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749524#action_12749524 ] Steven Rowe commented on LUCENE-1873: - I'm +1 on switching away from "Sandbox" (no longer used at all) to Contrib. Before you posted your patch, I had written up a new intro for the contrib index.html - feel free to take any of this or ignore it :) : {code:html} The Lucene Java project also contains a workspace, Lucene Contrib (formerly known as the Lucene Sandbox), that is open both to all Lucene Java core committers and to developers whose commit rights are restricted to the Contrib workspace; these developers are referred to as "Contrib committers". The Lucene Contrib workspace hosts the following types of packages: Various third party contributions. Contributions with third party dependencies - the Lucene Java core distribution has no external runtime dependencies. New ideas that are intended for eventual inclusion into the Lucene Java core. Users are free to experiment with the components developed in the Contrib workspace, but Contrib packages will not necessarily be maintained, particularly in their current state. The Lucene Java core backwards compatibility commitments (see http://wiki.apache.org/lucene-java/BackwardsCompatibility"; >http://wiki.apache.org/lucene-java/BackwardsCompatibility) do not necessarily extend to the packages in the Contrib workspace. See the README.txt file for each Contrib package for details. If the README.txt file does not address its backwards compatibility commitments, users should assume it does not make any compatibility commitments. {code} > Update site lucene-sandbox page > --- > > Key: LUCENE-1873 > URL: https://issues.apache.org/jira/browse/LUCENE-1873 > Project: Lucene - Java > Issue Type: Bug >Reporter: Mark Miller >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1873.patch > > > The page has misleading/bad info. One thing I would like to do - but I won't > attempt now (prob good for the modules issue) - is commit to one word - > contrib or sandbox. I think sandbox should be purged myself. > The current page says that the sandbox is kind of a rats nest with various > early stage software that one day may make it into core - that info is > outdated I think. We should replace it, and also specify how the back compat > policy works in contrib eg each contrib can have its own policy, with the > default being no policy. > We should also drop the piece about being open to Lucene's committers and > others - a bit outdated. > We should also either include the other contribs, or change the wording to > indicate that the list is only a sampling of the many contribs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060 ] Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM: --- bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*". The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile". The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt(). was (Author: steve_rowe): bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*". The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile". The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt(). > RegexQuery matches terms the input regex doesn't actually match > --- > > Key: LUCENE-1683 > URL: https://issues.apache.org/jira/browse/LUCENE-1683 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.3.2 >Reporter: Trejkaz > > I was writing some unit tests for our own wrapper around the Lucene regex > classes, and got tripped up by something interesting. > The regex "cat." will match "cats" but also anything with "cat" and 1+ > following letters (e.g. "cathy", "catcher", ...) It is as if there is an > implicit .* always added to the end of the regex. > Here's a unit test for the behaviour I would expect myself: > @Test > public void testNecessity() throws Exception { > File dir = new File(new File(System.getProperty("java.io.tmpdir")), > "index"); > IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), > true); > try { > Document doc = new Document(); > doc.add(new Field("field", "cat cats cathy", Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(doc); > } finally { > writer.close(); > } > IndexReader reader = IndexReader.open(dir); > try { > TermEnum terms = new RegexQuery(new Term("field", > "cat.")).getEnum(reader); > assertEquals("Wrong term", "cats", terms.term()); > assertFalse("Should have only been one term", terms.next()); > } finally { > reader.close(); > } > } > This test fails on the term check with terms.term() equal to "cathy". > Our workaround is to mangle the query like this: > String fixed = String.format("(?:%s)$", original); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060 ] Steven Rowe commented on LUCENE-1683: - bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*". The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile". The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt(). > RegexQuery matches terms the input regex doesn't actually match > --- > > Key: LUCENE-1683 > URL: https://issues.apache.org/jira/browse/LUCENE-1683 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.3.2 >Reporter: Trejkaz > > I was writing some unit tests for our own wrapper around the Lucene regex > classes, and got tripped up by something interesting. > The regex "cat." will match "cats" but also anything with "cat" and 1+ > following letters (e.g. "cathy", "catcher", ...) It is as if there is an > implicit .* always added to the end of the regex. > Here's a unit test for the behaviour I would expect myself: > @Test > public void testNecessity() throws Exception { > File dir = new File(new File(System.getProperty("java.io.tmpdir")), > "index"); > IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), > true); > try { > Document doc = new Document(); > doc.add(new Field("field", "cat cats cathy", Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(doc); > } finally { > writer.close(); > } > IndexReader reader = IndexReader.open(dir); > try { > TermEnum terms = new RegexQuery(new Term("field", > "cat.")).getEnum(reader); > assertEquals("Wrong term", "cats", terms.term()); > assertFalse("Should have only been one term", terms.next()); > } finally { > reader.close(); > } > } > This test fails on the term check with terms.term() equal to "cathy". > Our workaround is to mangle the query like this: > String fixed = String.format("(?:%s)$", original); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1719: Attachment: LUCENE-1719.patch Updated patch including information about ICU4J's shorter key length; adding a link to the ICU4J documentation's comparison of ICU4J and java.text.Collator key generation time and key length; and removing specific performance numbers. > Add javadoc notes about ICUCollationKeyFilter's advantages over > CollationKeyFilter > -- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch, LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is > faster than CollationKeyFilter, the JVM-provided java.text.Collator > implementation in the same package. The javadocs of these classes should be > modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's > comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] > on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x > faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions > 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of > 4 languages (taken from the corresponding Debian wordlist packages and > truncated to the first 90k words after a fixed random shuffling), using > Collators at the default strength, on a Windows Vista 64-bit machine. I used > an analysis pipeline consisting of WhitespaceTokenizer chained to the > collation key filter, so to isolate the time taken by the collation key > filters, I also timed WhitespaceTokenizer operating alone for each > combination. The rightmost column represents the performance advantage of > the ICU4J implemtation (ICU) over the java.text.Collator implementation > (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / > (ICU-WST). The best times out of 5 runs for each combination, in > milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J > Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|156%| > |1.4.2_17 (32 bit)|French|716|243|14|207%| > |1.4.2_17 (32 bit)|German|669|264|16|163%| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%| > |1.5.0_15 (32 bit)|English|604|176|16|268%| > |1.5.0_15 (32 bit)|French|817|209|17|317%| > |1.5.0_15 (32 bit)|German|799|225|20|280%| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%| > |1.5.0_15 (64 bit)|English|431|89|10|433%| > |1.5.0_15 (64 bit)|French|562|112|11|446%| > |1.5.0_15 (64 bit)|German|567|116|13|438%| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%| > |1.6.0_13 (64 bit)|English|162|81|9|113%| > |1.6.0_13 (64 bit)|French|192|92|10|122%| > |1.6.0_13 (64 bit)|German|204|99|14|124%| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725023#action_12725023 ] Steven Rowe commented on LUCENE-1719: - bq. [...] i searched lucene source code for java.text.Collator and found some uses of it (the incremental facility). I wonder if in the future we could find a way to allow usage of com.ibm.icu.text.Collator in these spots. +1 I guess the way to go would be to make the implementation pluggable. > Add javadoc notes about ICUCollationKeyFilter's advantages over > CollationKeyFilter > -- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is > faster than CollationKeyFilter, the JVM-provided java.text.Collator > implementation in the same package. The javadocs of these classes should be > modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's > comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] > on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x > faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions > 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of > 4 languages (taken from the corresponding Debian wordlist packages and > truncated to the first 90k words after a fixed random shuffling), using > Collators at the default strength, on a Windows Vista 64-bit machine. I used > an analysis pipeline consisting of WhitespaceTokenizer chained to the > collation key filter, so to isolate the time taken by the collation key > filters, I also timed WhitespaceTokenizer operating alone for each > combination. The rightmost column represents the performance advantage of > the ICU4J implemtation (ICU) over the java.text.Collator implementation > (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / > (ICU-WST). The best times out of 5 runs for each combination, in > milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J > Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|156%| > |1.4.2_17 (32 bit)|French|716|243|14|207%| > |1.4.2_17 (32 bit)|German|669|264|16|163%| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%| > |1.5.0_15 (32 bit)|English|604|176|16|268%| > |1.5.0_15 (32 bit)|French|817|209|17|317%| > |1.5.0_15 (32 bit)|German|799|225|20|280%| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%| > |1.5.0_15 (64 bit)|English|431|89|10|433%| > |1.5.0_15 (64 bit)|French|562|112|11|446%| > |1.5.0_15 (64 bit)|German|567|116|13|438%| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%| > |1.6.0_13 (64 bit)|English|162|81|9|113%| > |1.6.0_13 (64 bit)|French|192|92|10|122%| > |1.6.0_13 (64 bit)|German|204|99|14|124%| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1719: Description: contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|156%| |1.4.2_17 (32 bit)|French|716|243|14|207%| |1.4.2_17 (32 bit)|German|669|264|16|163%| |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%| |1.5.0_15 (32 bit)|English|604|176|16|268%| |1.5.0_15 (32 bit)|French|817|209|17|317%| |1.5.0_15 (32 bit)|German|799|225|20|280%| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%| |1.5.0_15 (64 bit)|English|431|89|10|433%| |1.5.0_15 (64 bit)|French|562|112|11|446%| |1.5.0_15 (64 bit)|German|567|116|13|438%| |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%| |1.6.0_13 (64 bit)|English|162|81|9|113%| |1.6.0_13 (64 bit)|French|192|92|10|122%| |1.6.0_13 (64 bit)|German|204|99|14|124%| |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%| was: contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|2.6x| |1.4.2_17 (32 bit)|French|716|243|14|3.1x| |1.4.2_17 (32 bit)|German|669|264|16|2.6x| |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| |1.5.0_15 (32 bit)|English|604|176|16|3.7x| |1.5.0_15 (32 bit)|French|817|209|17|4.2x| |1.5.0_15 (32 bit)|German|799|225|20|3.8x| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| |1.5.0_15 (64 bit)|English|431|89|10|5.3x| |1.5.0_15 (64 bit)|French|562|112|11|5.5x| |1.5.0_15 (64 bit)|German|567|116|13|5.4x| |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| |1.6.0_13 (64 bit)|English|162|81|9|2.1x| |1.6.0_13 (64 bit)|French|192|92|10|2.2x| |1.6.0_13 (64 bit)|German|204|99|14|2.2x| |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| Summary: Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter (was: Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter) Edited title to reflect addition of key length concerns, and switched performance improvement column to be percentage improvements rather than multiplie
[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724974#action_12724974 ] Steven Rowe commented on LUCENE-1719: - Cool! Thanks for the link, Robert. Key comparison under Lucene when using *CollationKeyAnalyzer will utilize neither ICU4J's nor the java.text incremental collation facilities - the base-8000h-String-encoded raw collation keys will be directly compared (and sorted) as Strings. So key generation time and, as you point out, key length are the appropriate measures here. I'll post a patch shortly that includes your ICU4J link, and mentions the key length aspect as well. I'll also remove specific numbers from the javadoc notes - people can follow the ICU4J link if they're interested. > Add javadoc notes about ICUCollationKeyFilter's speed advantage over > CollationKeyFilter > --- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Affects Versions: 2.4.1 >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is > faster than CollationKeyFilter, the JVM-provided java.text.Collator > implementation in the same package. The javadocs of these classes should be > modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's > comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] > on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x > faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions > 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of > 4 languages (taken from the corresponding Debian wordlist packages and > truncated to the first 90k words after a fixed random shuffling), using > Collators at the default strength, on a Windows Vista 64-bit machine. I used > an analysis pipeline consisting of WhitespaceTokenizer chained to the > collation key filter, so to isolate the time taken by the collation key > filters, I also timed WhitespaceTokenizer operating alone for each > combination. The rightmost column represents the performance advantage of > the ICU4J implemtation (ICU) over the java.text.Collator implementation > (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / > (JVM-WST). The best times out of 5 runs for each combination, in > milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J > Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|2.6x| > |1.4.2_17 (32 bit)|French|716|243|14|3.1x| > |1.4.2_17 (32 bit)|German|669|264|16|2.6x| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| > |1.5.0_15 (32 bit)|English|604|176|16|3.7x| > |1.5.0_15 (32 bit)|French|817|209|17|4.2x| > |1.5.0_15 (32 bit)|German|799|225|20|3.8x| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| > |1.5.0_15 (64 bit)|English|431|89|10|5.3x| > |1.5.0_15 (64 bit)|French|562|112|11|5.5x| > |1.5.0_15 (64 bit)|German|567|116|13|5.4x| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| > |1.6.0_13 (64 bit)|English|162|81|9|2.1x| > |1.6.0_13 (64 bit)|French|192|92|10|2.2x| > |1.6.0_13 (64 bit)|German|204|99|14|2.2x| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724926#action_12724926 ] Steven Rowe commented on LUCENE-1581: - {quote} you could add the JDK collation key filter to core if you wanted a core fix. but the icu one is up to something like 30x faster than the jdk, so why bother :) {quote} LUCENE-1719 contains some timings I made about the relative speeds of these two implementations. In short, for the platform/language/collator/JVM version combinations I tested, the ICU4J implementation's speed advantage ranges from 1.4x to 5.5x. > LowerCaseFilter should be able to be configured to use a specific locale. > - > > Key: LUCENE-1581 > URL: https://issues.apache.org/jira/browse/LUCENE-1581 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Digy > Attachments: TestTurkishCollation.java > > > //Since I am a .Net programmer, Sample codes will be in c# but I don't think > that it would be a problem to understand them. > // > Assume an input text like "İ" and and analyzer like below > {code} > public class SomeAnalyzer : Analyzer > { > public override TokenStream TokenStream(string fieldName, > System.IO.TextReader reader) > { > TokenStream t = new SomeTokenizer(reader); > t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); > t = new LowerCaseFilter(t); > return t; > } > > } > {code} > > ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return > "i" (if locale is "en-US") > or > "ı' if(locale is "tr-TR") (that means,this token should be input to > another instance of ASCIIFoldingFilter) > So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, > but a better approach can be adding > a new constructor to LowerCaseFilter and forcing it to use a specific locale. > {code} > public sealed class LowerCaseFilter : TokenFilter > { > /* +++ */System.Globalization.CultureInfo CultureInfo = > System.Globalization.CultureInfo.CurrentCulture; > public LowerCaseFilter(TokenStream in) : base(in) > { > } > /* +++ */ public LowerCaseFilter(TokenStream in, > System.Globalization.CultureInfo CultureInfo) : base(in) > /* +++ */ { > /* +++ */ this.CultureInfo = CultureInfo; > /* +++ */ } > > public override Token Next(Token result) > { > result = Input.Next(result); > if (result != null) > { > char[] buffer = result.TermBuffer(); > int length = result.termLength; > for (int i = 0; i < length; i++) > /* +++ */ buffer[i] = > System.Char.ToLower(buffer[i],CultureInfo); > return result; > } > else > return null; > } > } > {code} > DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
[ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724923#action_12724923 ] Steven Rowe commented on LUCENE-1719: - I also tested ICU4J version 4.2 (released 6 weeks ago), and the timings were nearly identical to those from ICU4J version 4.0 (the one that's in contrib/collation/lib/). The timings given in the table above were not produced with the "-server" option to the JVM. I separately tested all combinations using the "-server" option, but there was no difference for the 32-bit JVMs, though roughly 3-4% faster for the 64-bit JVMs. I got the impression (didn't actually calculate) that although the best times of 5 runs were better for the 64-bit JVMs when using the "-server" option, the average times seemed to be slightly worse. In any case, the performance improvement of the ICU4J implementation over the java.text.Collator implementation was basically unaffected by the use of the "-server" JVM option. > Add javadoc notes about ICUCollationKeyFilter's speed advantage over > CollationKeyFilter > --- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Steven Rowe >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is > faster than CollationKeyFilter, the JVM-provided java.text.Collator > implementation in the same package. The javadocs of these classes should be > modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's > comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] > on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x > faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions > 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of > 4 languages (taken from the corresponding Debian wordlist packages and > truncated to the first 90k words after a fixed random shuffling), using > Collators at the default strength, on a Windows Vista 64-bit machine. I used > an analysis pipeline consisting of WhitespaceTokenizer chained to the > collation key filter, so to isolate the time taken by the collation key > filters, I also timed WhitespaceTokenizer operating alone for each > combination. The rightmost column represents the performance advantage of > the ICU4J implemtation (ICU) over the java.text.Collator implementation > (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / > (JVM-WST). The best times out of 5 runs for each combination, in > milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J > Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|2.6x| > |1.4.2_17 (32 bit)|French|716|243|14|3.1x| > |1.4.2_17 (32 bit)|German|669|264|16|2.6x| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| > |1.5.0_15 (32 bit)|English|604|176|16|3.7x| > |1.5.0_15 (32 bit)|French|817|209|17|4.2x| > |1.5.0_15 (32 bit)|German|799|225|20|3.8x| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| > |1.5.0_15 (64 bit)|English|431|89|10|5.3x| > |1.5.0_15 (64 bit)|French|562|112|11|5.5x| > |1.5.0_15 (64 bit)|German|567|116|13|5.4x| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| > |1.6.0_13 (64 bit)|English|162|81|9|2.1x| > |1.6.0_13 (64 bit)|French|192|92|10|2.2x| > |1.6.0_13 (64 bit)|German|204|99|14|2.2x| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org