[jira] Commented: (LUCENE-1799) Unicode compression

Robert Muir (JIRA) Wed, 28 Jul 2010 15:09:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893391#action_12893391
 ]


Robert Muir commented on LUCENE-1799:
-------------------------------------

by the way, to explain your results on french and german:

since the compression is a diff from the 'middle of the alphabet' (unicode 
block), an unaccented char, accented char, unaccented char combination will 
cause 2 2-byte diffs.
in utf-8 encoding this sequence is 4 bytes, but in bocu it becomes 5.

The reason you experienced anything of measure is, I think because of 
whitespaceanalyzer (which i feel is a tad unrealistic)
for example, all the german stemmers do something with the umlauts (remove or 
substitute ue, oe, etc).

In general, lots of our analysis for lots of languages folds and normalizes 
characters in ways like this, that also serves to help the compression
so I think if you used germananalyzer on the german text instead of 
whitespaceanalyzer, you wouldn't see much of size increase.


> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>         Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Reply via email to