Hi All, It seems that I need to be able to search pdfs.
I'm trying to hook in: LucenePDFDocument.getDocument(pdfurl) where the LucenePDFDocument class comes from pdfbox.org. LucenePDFDocument.getDocument(pdfurl)returns a Lucene Document. Where would be the best place in the nutch classes to do this? I think I'm looking for the bit where the html is brought in, parsed, and the text written to the various Document fields. I was going to add something like: if it's html, do what you were doing before if it's a pdf, then do LucenePDFDocument.getDocument(pdfurl), translating any fieldnames where necessary to look like what nutch wants, then handing that document back to nutch for processing. Any ideas? Thanks, Skip This e-mail is only intended for the person(s) to whom it is addressed and may contain confidential information. Aspect Group does not accept responsibility for any loss or damage caused by this email or any attachments. Unless clearly stated to the contrary, any opinions or comments are personal to the writer and are not made on behalf of Aspect Group. If you have received this e-mail in error, please notify us immediately at [EMAIL PROTECTED] and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any other person. Thank you for your co-operation. Aspect Group is the business name of Aspect Internet Holdings Limited, Aspect Technologies Limited and Nettec Solutions Limited. ------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
