[jira] [Commented] (LUCENE-1799) Unicode compression

2012-03-06 Thread DM Smith (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223294#comment-13223294
 ] 

DM Smith commented on LUCENE-1799:
--

Would someone be able to champion this. It appears ready to go. for the last 
1.5 years. Looks like it is merely a permission problem. I'd like to see it get 
in the 3.x series.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890411#action_12890411
 ] 

Uwe Schindler commented on LUCENE-1799:
---

For correctness of code: target.offset = buffer.arrayOffset() + 
buffer.position();
But for most cases position() will be 0, but this is quite often an error. If 
you use limit() you have to use position(), else its inconsistent. 
arrayOffset() gives the offset corresponding to position=0.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890415#action_12890415
 ] 

Robert Muir commented on LUCENE-1799:
-

Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i 
said i dont plan on committing anything (unless somethign is figured out about 
the patent), but it might be useful to someone.

I tested this on some hindi text:
||encoding||tii||tis||
|utf-8|60,205|3,740,329|
|bocu-1|28,431|2,168,407|

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890416#action_12890416
 ] 

Uwe Schindler commented on LUCENE-1799:
---

And one thing more, in the non-array case:
buffer.get(target.bytes, target.offset, limit); target's offset should be set 
to 0 on all write operations to bytesref (see 
UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize 
correct!

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890628#action_12890628
 ] 

Michael McCandless commented on LUCENE-1799:


This is fabulous!  And a great example of what's now possible w/ the cutover to 
opaque binary terms w/ flex -- makes it easy to swap out how terms are encoded.

BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.

This encoding would also naturally reduce the RAM required for the terms index 
and Terms/TermsIndex FieldCache (used when you sort by string field) as well, 
since Lucene just loads the [opaque] term bytes into RAM.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890654#action_12890654
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. You can use any Charset to encode your terms. The javadocs should only 
note, that the byte[] order should be correct for range queries to work

I don't think we should add support for any non-unicode character sets.

bq. If you want your complete index e.g. in ISO-8859-1

I am 100% against doing this.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890703#action_12890703
 ] 

Michael McCandless commented on LUCENE-1799:


Is there any reason not to make BOCU-1 Lucene's default encoding?

UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like 
we expect little to no indexing or searching perf penalty (once we have a 
faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890706#action_12890706
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. Is there any reason not to make BOCU-1 Lucene's default encoding?

in my opinion, just IBM :) But maybe we can make a strong implementation and 
they will approve it and give us a patent:

http://unicode.org/notes/tn6/#Intellectual_Property

bq. UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds 
like we expect little to no indexing or searching perf penalty (once we have a 
faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

I'd like to play with swapping it in as the default, just to see what problems 
(if any) there are, and to make sure all queries are supported, etc. I can 
upload a new patch that does it this way and we can play.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890709#action_12890709
 ] 

Michael McCandless commented on LUCENE-1799:


{quote}

> Is there any reason not to make BOCU-1 Lucene's default encoding?

in my opinion, just IBM :)
{quote}

But... ICU's license is compatible w/ ASL (I think), and includes a working 
impl of BOCU-1, so aren't we in the clear here?  Ie we are free to take that 
impl, tweak it, add to our sources, and include ICU's license in our 
LICENSE/NOTICE?

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890718#action_12890718
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. But... ICU's license is compatible w/ ASL (I think), and includes a working 
impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that 
impl, tweak it, add to our sources, and include ICU's license in our 
LICENSE/NOTICE?

I dont know... personally i wouldnt feel comfortable committing something 
without getting guidance first. but we can explore the technicals with patches 
on this jira issue and not check the box and i think this is all ok for now.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891101#action_12891101
 ] 

Robert Muir commented on LUCENE-1799:
-

btw that patch is huge because i just sucked in the icu charset stuff to have 
an implementation that works for testing... 

its not intended to ever be that way as we would just implement the stuff we 
need without this code, but it makes it easier to test since you dont need any 
external jars or muck with the build system at all.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892995#action_12892995
 ] 

Robert Muir commented on LUCENE-1799:
-

I ran tests, each one of Mike's optimizations speed up the encode...

I think I agree with not unrolling the 4-byte, the "diff" from the previous 
character has to be > 187659 [0x2dd0b]
this is like pottery writings and oracle bone script... but the previous ones 
(2x, 3x) speed up CJK and other scripts and are very useful.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893280#action_12893280
 ] 

Yonik Seeley commented on LUCENE-1799:
--

I took a stab at benchmarking encoding speed only with some different languages.
I encoded a word at a time (which happens at indexing time).
I used some text from wikipedia in different languages: english, german, 
french, spanish, and chinese.
I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese 
(but analysis speed is not measured.)

|encoding|english|german|french|spanish|chinese|
|UTF-8 size|1888|4550|4875|5123|4497|
|BOCU-1 size|1888|4610|4995|5249|4497|
|BOCU slowdown|29%|39%|47%|61%|80%

I suspect that the StandardAnalyzer is spitting out individual CJK chars, and 
hence the same size of BOCU-1 and UTF-8?
I'll try and see if I can get SmartChineseAnalyzer working and re-run the 
chinese test.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893288#action_12893288
 ] 

Robert Muir commented on LUCENE-1799:
-

yonik, what were you benchmarking? I think you should benchmark overall 
indexing time, of which encode is just a blip (<1% of).

and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, 
but any subsequent ones yield savings.

in general you wont get much compression for chinese.. id say max 25%
for russian, arabic, hebrew, japanese it will do a lot better: max 40%
for indian languages you tend to get about 50%.

I also dont know how you encoded word at a time, because i get quite different 
results. I focused a lot on 'single-byte diffs' to be fast (e.g. just 
subtraction) and I think i do a lot better for english than the 160% described 
in http://unicode.org/notes/tn6/

Furthermore, utf-8 is a complete no-op for english, so being a compression 
algorithm that is only 29% slower than (byte) char is good in my book, but i 
dont measure 29% for english.

I don't think there is any problem in encode speed at all.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893292#action_12893292
 ] 

Michael Busch commented on LUCENE-1799:
---

Yonik can you give more details about how you ran your tests?

Was it an isolated string encoding test or does BOCU slow down overall indexing 
speed by 29%-80% (which would be hard to believe).

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893297#action_12893297
 ] 

Yonik Seeley commented on LUCENE-1799:
--

bq. Yonik can you give more details about how you ran your tests?

I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens 
in different languages.
So I took some text from wikipedia, analyze it to get a list of char[], then 
encode each char[] in a loop.  It's only the last step that is benchmarked to 
isolate the encode performance.  I'm certainly not claiming that indexing is n% 
slower.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893306#action_12893306
 ] 

Yonik Seeley commented on LUCENE-1799:
--

bq. in general you wont get much compression for chinese.. id say max 25%

Ah, OK.
I just tried russian w/ whitespace analyzer used to split and did get a good 
size savings:

UTF8_size=11056  BOCU-1_size=6810  BOCU-1_slowdown=32%

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893309#action_12893309
 ] 

Robert Muir commented on LUCENE-1799:
-

Yonik, please see my issue.

the fact we can encode 100 million terms in 15 seconds, means any speed stuff 
is meaningless (though i still insist, something is wrong: either your 
benchmark, or it runs slower on your JDK or something (which we should try to 
improve)

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893312#action_12893312
 ] 

Michael McCandless commented on LUCENE-1799:


The char[] -> byte[] encode time is a miniscule part of indexing time.  And, in 
turn, indexing time is far less important than impact on search performance.  
So... let's focus on the search performance here.

Most queries are unaffected by the term encoding; it's only AutomatonQuery (= 
fuzzy, regexp, wildcard) that do a fair amount of decoding...

Net/net BOCU1 sounds like an awesome win over UTF8.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893317#action_12893317
 ] 

Robert Muir commented on LUCENE-1799:
-

I just insist there is no real difference between this and UTF-8 for encoding 
english...

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893324#action_12893324
 ] 

Yonik Seeley commented on LUCENE-1799:
--

OK, I just tried Robert's Benchmark.java (i.e. fake english word encoding):
UTF8=15731  BOCU-1=16961  (lowest of 5 diff runs)

But looking at the benchmark, it looks like the majority of the time could be 
just making random strings.
I made a modified Benchmark.java that pulls out this string creation and only 
tests encoding performance.
Here are my results:

UTF8=2936 BOCU-1=4310
It turns out that making the random strings to encode took up 81% of the UTF8 
time.

System: Win7 64 bit, JVM=Sun 1.6.0_21 64 bit -server

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, LUCENE-1779.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893329#action_12893329
 ] 

Robert Muir commented on LUCENE-1799:
-

Thats good news, so we can encode 100 million strings in 4.3 seconds? 
I dont think we need to discuss performance any further, this is a complete 
non-issue.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893339#action_12893339
 ] 

Yonik Seeley commented on LUCENE-1799:
--

bq. Thats good news, so we can encode 100 million strings in 4.3 seconds? I 
dont think we need to discuss performance any further, this is a complete 
non-issue.

Well... hopefully it's not an issue.
That should really be tested with real indexing when the time comes 
(micro-benchmarks can do funny things).

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893343#action_12893343
 ] 

Robert Muir commented on LUCENE-1799:
-

{quote}
Well... hopefully it's not an issue.
That should really be tested with real indexing when the time comes 
(micro-benchmarks can do funny things).
{quote}

its definitely not an issue no lucene indexer can do anything with 100 million 
strings in any reasonable time where this will matter.

instead most non-latin languages will be writing less bytes, causing less real 
i/o, using half the RAM at search time, etc which is way more dramatic.

utf-8 is a non-option for our internal memory encoding, i'm suggesting bocu-1, 
but if you want to try to fight me all the way, then i'll start fighting for a 
reversion back to char[] instead... its at least less biased.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893353#action_12893353
 ] 

Yonik Seeley commented on LUCENE-1799:
--

Ummm, so you're against actually measuring any indexing performance decrease?  
That's all I was suggesting.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893357#action_12893357
 ] 

Robert Muir commented on LUCENE-1799:
-

I dont think its measurable. 100 million strings in 4.3 seconds? this has no 
affect.

keep in mind, i fixed the analysis in 3.1 and doubled the speed of the default 
english indexing in solr, 
so if you want to improve indexing speed, i think you will be more successful 
looking at other parts of the code.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893362#action_12893362
 ] 

Yonik Seeley commented on LUCENE-1799:
--

bq. so if you want to improve indexing speed, i think you will be more 
successful looking at other parts of the code.

I have only been measuring performance at this point, and I haven't expressed 
an option about what defaults should be used.
If we convert to BOCU-1 as a default, and if UTF-8 remains an option, then I'd 
at least want to be able to document any trade-offs and when people should 
consider setting the encoding back to UTF-8.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893368#action_12893368
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. I have only been measuring performance at this point

You havent really been measuring performance, you have just been trying to pick 
a fight.
# any difference in encode has almost no effect on indexing speed, like i said, 
100 million strings in 4.3 seconds?
# you aren't factoring i/o nor ram into the equation for the writing systems 
(of which there are many) where this actually cuts terms to close half their 
size.
# since this is a compression algorithm (and I'm still working on it), its 
vital to include these things, and not post useless benchmarks about whether it 
takes 2.9 or 4.3 seconds to encode 100 million strings, which nothing in lucene 
can do anything with in any short time anyway.

I have a benchmark for UTF-8: and its that i have a lot of text that is twice 
as big on disk and causes twice as much io and eats up twice as much ram than 
it should. 
bocu-1 fixes that, and at the same time keeps ascii at a single-byte encoding 
(and other latin languages are very close).
so everyone can potentially win.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893381#action_12893381
 ] 

Yonik Seeley commented on LUCENE-1799:
--

bq. You havent really been measuring performance, you have just been trying to 
pick a fight.

I'm sorry if it appeared that way, and apologize for anything I said to 
encourage that perception.

I was genuinely surprised when you reported "now we are faster than utf-8 on 
average for encode", so I set out to benchmark it myself and report back.  In 
addition, I wanted to see what the encoding speed diff was for some different 
languages.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893386#action_12893386
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. I was genuinely surprised when you reported "now we are faster than utf-8 
on average for encode", so I set out to benchmark it myself and report back. In 
addition, I wanted to see what the encoding speed diff was for some different 
languages.

For all of unicode yes, you just didnt pick a good variety of languages, or 
didnt tokenize them well (e.g. using an english tokenizer for chinese). 
I've been measuring against many, and i already checked the bigram 
(cjktokenizer) case to make sure that cjk was always smaller (its not much... 
e.g. 5 bytes instead of 6, but its better)


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893391#action_12893391
 ] 

Robert Muir commented on LUCENE-1799:
-

by the way, to explain your results on french and german:

since the compression is a diff from the 'middle of the alphabet' (unicode 
block), an unaccented char, accented char, unaccented char combination will 
cause 2 2-byte diffs.
in utf-8 encoding this sequence is 4 bytes, but in bocu it becomes 5.

The reason you experienced anything of measure is, I think because of 
whitespaceanalyzer (which i feel is a tad unrealistic)
for example, all the german stemmers do something with the umlauts (remove or 
substitute ue, oe, etc).

In general, lots of our analysis for lots of languages folds and normalizes 
characters in ways like this, that also serves to help the compression
so I think if you used germananalyzer on the german text instead of 
whitespaceanalyzer, you wouldn't see much of size increase.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893398#action_12893398
 ] 

Robert Muir commented on LUCENE-1799:
-

{quote}
But looking at the benchmark, it looks like the majority of the time could be 
just making random strings.
I made a modified Benchmark.java that pulls out this string creation and only 
tests encoding performance.
Here are my results:

UTF8=2936 BOCU-1=4310
{quote}

I think your benchmark isnt very reliable (i got really different results), so 
i added an extra 0 to do 10x more terms:
char[][] terms = new char[1][];

ret=716132704 UTF-8 encode: 35081
ret=716132704BOCU-1 encode: 36517

Like i said before, i don't see a 20% difference.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893403#action_12893403
 ] 

Yonik Seeley commented on LUCENE-1799:
--

bq. I think your benchmark isnt very reliable (i got really different results), 
so i added an extra 0 to do 10x more terms:

Did that change the ratio for you?  I just tried 10x more terms, and I got the 
exact same ratio:

ret=708532704 UTF-8 encode: 30524
ret=708532704BOCU-1 encode: 44635

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893413#action_12893413
 ] 

Robert Muir commented on LUCENE-1799:
-

yeah it did (it didnt seem 'stable' but the first run was much different than 
yours, e.g. 3300 vs 3500 or so).

I just ran with -server also [using my same 64-bit 1.6.0_19 as before]:
there is more of a difference, however not as much as yours
ret=704032704 UTF-8 encode: 32134
ret=704032704BOCU-1 encode: 36391

but go figure, if i run with my 32-bit [same jdk: 1.6.0_19], i get horrible 
numbers!
here is -client
ret=684832704 UTF-8 encode: 26237
ret=684832704BOCU-1 encode: 54662

here is -server
ret=697132704 UTF-8 encode: 30062
ret=697132704BOCU-1 encode: 46293

so there is definitely an issue with 32-bit jvm, sure yours is 64-bit?


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893456#action_12893456
 ] 

Yonik Seeley commented on LUCENE-1799:
--

Hmmm, interesting.  I'm sure my JVM is 64 bit, and I just double-checked that 
the IDE is using that to launch the benchmark.  The differences we see might be 
down to CPU?

Here's my 64 bit JVM I'm using:
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

And I just tried a 32 bit one (also -server) I had laying around:
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) Server VM (build 16.0-b13, mixed mode)

32 bit results:
ret=713832704 UTF-8 encode: 35895
ret=713832704BOCU-1 encode: 55855

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-11-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932207#action_12932207
 ] 

Earwin Burrfoot commented on LUCENE-1799:
-

.. and not the Codec, as was suggested in the beginning.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-11-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932206#action_12932206
 ] 

Earwin Burrfoot commented on LUCENE-1799:
-

Returning to this issue, right now the best place for this functionality seems 
to be a variant of CharTermAttribute?

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2011-02-06 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991170#comment-12991170
 ] 

DM Smith commented on LUCENE-1799:
--

Any idea as to when this will be released?

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org