Hi All,

It seems that I need to be able to search pdfs.

I'm trying to hook in:
LucenePDFDocument.getDocument(pdfurl)
where the LucenePDFDocument class comes from pdfbox.org.

LucenePDFDocument.getDocument(pdfurl)returns a Lucene Document.


Where would be the best place in the nutch classes to do this?

I think I'm looking for the bit where the html is brought in, parsed, and
the text written to the various Document fields.


I was going to add something like:

if it's html, do what you were doing before

if it's a pdf, then do LucenePDFDocument.getDocument(pdfurl), 
translating any fieldnames where necessary to look like what nutch wants,
then handing that document back to nutch for processing.


Any ideas?

Thanks,

Skip
This e-mail is only intended for the person(s) to whom it is addressed and
may contain confidential information. Aspect Group does not accept
responsibility for any loss or damage caused by this email or any
attachments. Unless clearly stated to the contrary, any opinions or comments
are personal to the writer and are not made on behalf of Aspect Group. If
you have received this e-mail in error, please notify us immediately at
[EMAIL PROTECTED] and then delete this message from your system. Please
do not copy it or use it for any purposes, or disclose its contents to any
other person. Thank you for your co-operation.
Aspect Group is the business name of Aspect Internet Holdings Limited,
Aspect Technologies Limited and Nettec Solutions Limited.


-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg=12297
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to