On 5/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> On May 15, 2006, at 12:08 PM, steven shingler wrote:
> > Am I right in thinking Ferret should be able to read a Lucene
> > generated
> > index no problem?
>
> That would be nice, but it is not currently the case because of
> Java's wacky "modified" UTF-8 serialization.  I've seen that plain
> ol' ASCII text indexes will be compatible, but once you put in some
> higher order characters things go askew.

Hey guys,

What Erik said is exactly correct. Marvin Humphrey, (author of
KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
that non-java ports of Lucene will be able to read Lucene indexes. It
currently slows Lucene down by about 25% at the moment (I think??) so
I'm going to be working with him to improve the performance of the
patch so that it can one day be included in Lucene. Don't hold your
breath though. It's going to take us a while to get it in there. For
now, I'd recommend using pdftotext as Jan already mentioned. I'm not
sure what is available on Windows but I'm sure it would be trivial to
write your own pdftotext using Java's PDFBox and then call it from
Ruby.

Cheers,
Dave

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to