subject:"\[jira\] Updated\: \(LUCENE\-1799\) Unicode compression"


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1799:


Attachment: LUCENE-1799.patch

i optimized the surrogate case here, moving it into the 'prev' calculation.
now we are faster than utf-8 on average for encode.

||encoding||time to encode 20 million strings (ms)||number of encoded bytes||
|UTF-8|1,756|596,516,000|
|BOCU-1|1,724|250,202,000|

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1799:


Attachment: LUCENE-1799.patch

oops, forgot a check in the surrogate case.

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1799:


Attachment: LUCENE-1799.patch

here it is with first stab at decoder (its correct against random icu strings, 
but i didnt benchmark yet)

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1799:


Attachment: Benchmark.java

attached is my benchmark for english text.

UTF-8: 15530ms
BOCU-1: 15687ms

Note, i use a Sun JVM 1.6.0_19 (64bit)

Yonik if you run this benchmark and find a problem with it / or its slower on 
your machine, let me know your configuration, because i dont see the results 
you do.

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

2010-07-28 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-1799:
-

Attachment: Benchmark.java

OK, hopefully the right Benchmark.java this time ;-)

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
 LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1799:
---

Attachment: LUCENE-1779.patch

Slightly more optimized version of BOCU1 encode (but it's missing the hash 
variant).

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1799:
---

Attachment: LUCENE-1799.patch

Duh -- that was some ancient wrong patch.  This one should be right!

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1799:
---

Attachment: LUCENE-1799.patch

Just inlines the 2-byte diff case.

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1799:
---

Attachment: LUCENE-1799.patch

Inlines/unwinds the 3-byte cases.  I think we can leave the 4 byte case as a 
for loop...

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

2010-07-27 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1799:


Attachment: LUCENE-1799.patch

removed some ifs for the positive unrolled cases.

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

2010-07-22 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-1799:

Attachment: LUCENE-1799_big.patch

attached is a really really rough patch that sets bocu-1 as the default
encoding.

Beware: its a work in progress and a lot of the patch is auto-generated
(eclipse) so some things need to be reverted.

Most tests pass, the idea is to find bugs in tests etc that abuse
bytesref/assume utf-8 encoding, things like that.

Unicode compression
---

Key: LUCENE-1799
URL: https://issues.apache.org/jira/browse/LUCENE-1799
Project: Lucene - Java
Issue Type: New Feature
Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch,
LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch

In lucene-1793, there is the off-topic suggestion to provide compression of
Unicode data. The motivation was a custom encoding in a Russian analyzer. The
original supposition was that it provided a more compact index.
This led to the comment that a different or compressed encoding would be a
generally useful feature.
BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM
with an implementation in ICU. If Lucene provide it's own implementation a
freely avIlable, royalty-free license would need to be obtained.
SCSU is another Unicode compression algorithm that could be used.
An advantage of these methods is that they work on the whole of Unicode. If
that is not needed an encoding such as iso8859-1 (or whatever covers the
input) could be used.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

[
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1799:
--

Attachment: LUCENE-1799.patch

Here the policed one :-)

In my opinion something is better than nothing. The patents are not violated
here, as we only use an abstract API and the string BOCU-1. You can use the
same code to encode in ISO-8859-1.

Unicode compression
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

[
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1799:
--

Attachment: (was: LUCENE-1799.patch)

Unicode compression
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

[
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1799:
--

Attachment: LUCENE-1799.patch

One more violation. Now its correct!

Unicode compression
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1799:
--

Attachment: LUCENE-1799.patch

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression

[
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1799:
--

Attachment: LUCENE-1799.patch

The last one that could be used with any charset

Unicode compression
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1799:
--

Attachment: (was: LUCENE-1799.patch)

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression


 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1799:
--

Attachment: (was: LUCENE-1799.patch)

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1799) Unicode compression