Re: Lucene and UTF-8

2005-08-29 Thread Ken Krugler
Hi Marvin, I'm guessing that since I'm the one that cares most about interoperability, I'll have to volunteer to do the heavy lifting. Tomorrow I'll go through and survey how many and which things would need to change to achieve full UTF-8 compliance. One concern is that I think in order to m

Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey
Hello again, I've prepared a patch for IndexInput.java, and an accompanying patch for TestIndexInput.java. I figured I would submit them for discussion here before filing them via Jira. The patches are attached to this email; if I find that they get stripped by the listserv, I'll post t

Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey
I wrote: The patches are attached to this email; if I find that they get stripped by the listserv, I'll post them on a website. They got stripped, so here are the links: http://www.rectangular.com/downloads/IndexInput.patch http://www.rectangular.com/downloads/TestIndexInput.patch Marvin Hu

Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey
Greets, I don't see any junit tests which address IndexOutput directly. I'm going to create one unless someone points out a file or portion thereof that I've overlooked. Marvin Humphrey Rectangular Research http://www.rectangular.com/ -

Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey
I wrote... I don't see any junit tests which address IndexOutput directly. I'm going to create one unless someone points out a file or portion thereof that I've overlooked. It would be a lot easier to check the output of IndexOutput if JUnit could compare byte arrays. :| I can loop thr

Re: Lucene and UTF-8

2005-09-20 Thread Chris Lamprecht
import java.util.Arrays; ... Arrays.equals(array1, array2); On 9/21/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > I wrote... > > > I don't see any junit tests which address IndexOutput directly. > > I'm going to create one unless someone points out a file or portion > > thereof that I've ove

Re: Lucene and UTF-8

2005-09-21 Thread Marvin Humphrey
On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote: import java.util.Arrays; ... Arrays.equals(array1, array2); Great, thank you, Chris. The patch for IndexOutput.java is done. It will now write valid UTF-8. Older versions of Lucene will not be able to read indexes written using this

Re: Lucene and UTF-8

2005-09-21 Thread Yonik Seeley
How does this patch work w.r.t. the length vint? It looks like the length is still the number of 16 bit java chars, but the encoding is now correct UTF-8? -Yonik Now hiring -- http://tinyurl.com/7m67g On 9/21/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Sep 20, 2005, at 11:53 PM, Chris

Re: Lucene and UTF-8

2005-09-21 Thread Marvin Humphrey
On Sep 21, 2005, at 12:25 PM, Yonik Seeley wrote: How does this patch work w.r.t. the length vint? It looks like the length is still the number of 16 bit java chars, but the encoding is now correct UTF-8? Yes. As Ken Krugler pointed out to me, the issues can be separated. The length VInt

Re: Lucene and UTF-8

2005-09-25 Thread Otis Gospodnetic
Hello, > Perl development is going very well, by the way. On the indexing > side, I've got a new app going which solves both the index > compatibility issue and the speed issue, about which I'll make a > presentation in this forum after I flesh it out and clean it up. > > Well, I'm lying a

Re: Lucene and UTF-8

2005-09-27 Thread Ken Krugler
> Perl development is going very well, by the way. On the indexing side, I've got a new app going which solves both the index compatibility issue and the speed issue, about which I'll make a presentation in this forum after I flesh it out and clean it up. > Well, I'm lying a little. Th

Re: Lucene and UTF-8

2005-09-27 Thread Marvin Humphrey
On Sep 27, 2005, at 7:01 AM, Ken Krugler wrote: Just to clarify, an incompatibility will occur if: a. The new code is used to write the index. b. The text being written contains an embedded null or an extended (not in the BMP) Unicode code point. c. Old code is then used to read the index.