[
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Rowe updated LUCENE-2084:
--------------------------------
Attachment: TopTFWikipediaWords.tar.bz2
TopTFWikipediaWords.tar.bz2 contains a Maven2 project to parse unpacked
Wikipedia dump files, create a Lucene index from the tokens produced by the
contrib WikipediaTokenizer, iterate over the indexed tokens' term docs,
accumulating term frequencies, store the results in a bounded priority queue,
then output contrib benchmark LineDoc format, with the title field containing
the collection term frequency, the date containing the date the file was
generated, and the body containing the term text.
This code knows how to handle English, German, French, and Ukrainian, but could
be extended for other languages.
I used this project to generate the line-docs for the 4 languages' 100k most
frequent terms, in the collation benchmark archive attachment on this issue.
> remove Byte/CharBuffer wrapping for collation key generation
> ------------------------------------------------------------
>
> Key: LUCENE-2084
> URL: https://issues.apache.org/jira/browse/LUCENE-2084
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch,
> LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2
>
>
> We can remove the overhead of ByteBuffer and CharBuffer wrapping in
> CollationKeyFilter and ICUCollationKeyFilter.
> this patch moves the logic in IndexableBinaryStringTools into char[],int,int
> and byte[],int,int based methods, with the previous Byte/CharBuffer methods
> delegating to these.
> Previously, the Byte/CharBuffer methods required a backing array anyway.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]