Hi Chris, I do not have a stats but I think the performance is reasonable. I use xpdf for PDF & wvWare for DOC. The size of my index is ~2GB (this is not limited to only pdf & doc). For avoiding memory problems, I have set an upperbound to the size of the documents that can be indexed. For example in my case I do not index documents if the size is more that 4MB. You could try something like that.
Thanks & Regards, George --- Chris Fraschetti <[EMAIL PROTECTED]> wrote: > Some of the tools listed use cmd line execs to > output a doc of some > sort to text and then I grab the text and add it to > a lucene doc, etc > etc... > > Any stats on the scalability of that? In large scale > applications, I'm > assuming this will cause some serious issues... > anyone have any input > on this? > > -Chris Fraschetti > > > On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer > <[EMAIL PROTECTED]> wrote: > > Honey George wrote: > > > > > Hi, > > > I know some of them. > > > 1. PDF > > > + http://www.pdfbox.org/ > > > + http://www.foolabs.com/xpdf/download.html > > > - I am using this and found good. It even > supports > > > > My dated experience from 2 years ago was that (the > evil, native code) > > foolabs pdf parser was the best, but obviously > things could have changed. > > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html > > > > > various languages. > > > 2. word > > > + http://sourceforge.net/projects/wvware > > > 3. excel > > > + > http://www.jguru.com/faq/view.jsp?EID=1074230 > > > > > > -George > > > --- [EMAIL PROTECTED] wrote: > > > > > >>Anyone know of any reliable parsers out there > for > > >>pdf word > > >>excel or powerpoint? > > > > For powerpoint it's not easy. I've been using this > and it has worked > > fine util recently and seems to sometimes go into > an infinite loop now > > on some recent PPTs. Native code and a package > that seems to be dormant > > but to some extent it does the job. The file > "ppthtml" does the work. > > > > http://chicago.sourceforge.net/xlhtml > > > > > > > > >> > > >> > > > > > > > --------------------------------------------------------------------- > > > > > >>To unsubscribe, e-mail: > > >>[EMAIL PROTECTED] > > >>For additional commands, e-mail: > > >>[EMAIL PROTECTED] > > >> > > >> > > > > > > > > > > > > > > > > > > > > > > ___________________________________________________________ALL-NEW > Yahoo! Messenger - all new features - even more fun! > http://uk.messenger.yahoo.com > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > ___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]