[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Rowe updated LUCENE-2167: -------------------------------- Attachment: LUCENE-2167.benchmark.patch This patch contains the benchmarking implementation I've been using. I'm pretty sure we don't want this stuff in Lucene, so I'm including it here only for reproducibility by others. I have hardcoded absolute paths to the ICU4J jar and the contrib/icu jar in the script I use to run the benchmark ({{lucene/contrib/benchmark/scripts/compare.uax29.analyzers.sh}}), so if anybody tries to run this stuff, they will have to first modify that script. On #lucene, Robert suggested comparing the performance of the straight ICU4J RBBI against UAX29Tokenizer, so I took his ICUTokenizer and associated classes, stripped out the script-detection logic, and made something I named RBBITokenizer, which is included in this patch. To run the benchmark, you have to first run "ant jar" in {{lucene/}} to produce the lucene core jar, and then again in {{lucene/contrib/icu/}}. Then in {{contrib/benchmark/}}, run {{scripts/compare.uax29.analyzers.sh}}. Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five): ||Operation||recsPerRun||rec/s||elapsedSec|| |ICUTokenizer|1268451|548,638.00|2.31| |RBBITokenizer|1268451|568,047.94|2.23| |StandardTokenizer|1262799|644,614.06|1.96| |UAX29Tokenizer|1268451|640,631.81|1.98| > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Attachments: LUCENE-2167.benchmark.patch, LUCENE-2167.patch, > LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, > LUCENE-2167.patch, LUCENE-2167.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the > standard as much as we can with jflex. Then its name would actually make > sense. > Such a transition would involve renaming the old StandardTokenizer to > EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff > can stay with that EuropeanTokenizer, and it could be used by the european > analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org