Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Ken Krugler
ore processing when calculating the character count, but that's a one-liner, right? -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Ken Krugler
ssing most list readers aren't too interested in the on-going discussion. If anybody else would like to be copied, send me an email. -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 --

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
ta. It's only the above two edge cases that create an interoperability problem. -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
..U+DFFF is defined as the range for the low (least significant) surrogate. -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towa

Re: Lucene and UTF-8

2005-08-29 Thread Ken Krugler
;t think this test data exists, unfortunately. But it shouldn't be too hard to generate. -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler
On Monday 29 August 2005 19:56, Ken Krugler wrote: "Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data." But wouldn't UTF-16 mean 2 bytes per character? Yes, UTF-16 means two bytes p

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
D] Sent: Monday, August 29, 2005 4:24 PM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read be

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: "Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data." But wouldn't UTF-16 mean 2 bytes per character? That doesn&#x

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
BMP characters. Thanks, -- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 and unit test failure for org.apache.analysis.ru.RussianStem in build with Kaffe

2005-09-22 Thread Ken Krugler
emason.org/lucene_reports_2005092001.tar.gz [3] - http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200509.mbox/[EMAIL PROTECTED] -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 - To uns

Re: Lucene and UTF-8

2005-09-27 Thread Ken Krugler
ing written contains an embedded null or an extended (not in the BMP) Unicode code point. c. Old code is then used to read the index. It may still make sense to defer this change to 2.0, but it's not at the level of changing the format of an index file. -- Ken -

Re: Lucene and Latent Semantic Indexing

2005-11-26 Thread Ken Krugler
ut I if we could find a small team (and of course, a lead), I would love to contribute ... > Sebastian. -- Ken Krugler Krugle, Inc. +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzers, perfect hash, ICU

2006-01-11 Thread Ken Krugler
ucene, I found some posts to the mailing list a while back, but nothing definitive. FWIW, my experience w/Eclipse 3.1 was that trying to auto-create Eclipse projects using the Ant build file didn't work very well. So we wound up manually creating the project, setting up the classpath,

Re: TestIndexInput test failures on jdk 1.6/linux after r641303

2009-01-05 Thread Ken Krugler
-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can

Use of Unicode data in Lucene

2009-02-25 Thread Ken Krugler
code data license and its compatibility with Apache 2.0. Does anybody know whether http://www.unicode.org/copyright.html creates an issue? What's the process for vetting a license? Or is this something I should be posting to a different list? Thanks, -- Ken

Re: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Ken Krugler
an then respond to your specific question. -- Ken -- Ken Krugler +1 530-210-6378

Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Ken Krugler
OI: http://www.krugle.org/kse/files/svn/svn.apache.org/poi/src/java/org/apache/poi/poifs/storage/HeaderBlockReader.java On line 83. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it" --

Potential bug in SloppyPhraseScorer

2008-06-24 Thread Ken Krugler
ked at the code, and the bug isn't obvious. Plus I worry about the probability of introducing a new bug with any modification. If anybody who's touched this code has time to look at the issue and comment, that would be great! Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 &

Re: Hadoop RPC for distributed Lucene

2008-07-11 Thread Ken Krugler
gt;[EMAIL PROTECTED] For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED] -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it"

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-13 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622432#action_12622432 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, FWIW, the issues being discu

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622746#action_12622746 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, So given that you and the Uni

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786712#action_12786712 ] Ken Krugler commented on LUCENE-1343: - Just to make sure this point doesn't

[jira] Commented: (LUCENE-826) Language detector

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804285#action_12804285 ] Ken Krugler commented on LUCENE-826: I think Nutch (and eventually Mahout) plan to