On 5/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On May 15, 2006, at 12:08 PM, steven shingler wrote: > > Am I right in thinking Ferret should be able to read a Lucene > > generated > > index no problem? > > That would be nice, but it is not currently the case because of > Java's wacky "modified" UTF-8 serialization. I've seen that plain > ol' ASCII text indexes will be compatible, but once you put in some > higher order characters things go askew.
Hey guys, What Erik said is exactly correct. Marvin Humphrey, (author of KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so that non-java ports of Lucene will be able to read Lucene indexes. It currently slows Lucene down by about 25% at the moment (I think??) so I'm going to be working with him to improve the performance of the patch so that it can one day be included in Lucene. Don't hold your breath though. It's going to take us a while to get it in there. For now, I'd recommend using pdftotext as Jan already mentioned. I'm not sure what is available on Windows but I'm sure it would be trivial to write your own pdftotext using Java's PDFBox and then call it from Ruby. Cheers, Dave _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

