Re: PDF documents with "MoreLikeThis" class

mark harwood Thu, 20 Jul 2006 04:33:55 -0700

>>Do I have to extract text from PDF file and then pass an InputStream with the 
>>text inside? 
Yes. 
Although technically you could pass the content unparsed it will contain a lot 
of unintelligible garbage in the form of markup and images.


All Lucene classes deliberately try and avoid the mucky business of parsing 
different specific document types.
This keeps the core engine very tightly focused on indexing and searching 
without having to deal with the ever-changing range of document formats.



----- Original Message ----
From: Davide <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 20 July, 2006 10:41:03 AM
Subject: PDF documents with "MoreLikeThis" class

Hi,
I'm using MoreLikeThis class to find similar documents... but I'm not
sure if it is correct to pass as argument a Pdf file to
*MoreLikeThis.like()* method.

Trying to be more clear:

1) In my Lucene index I add some PDF files (I use PDFBox to extract text
and add fields to index)
2) Now I want to search similar documents from a specific PDF file and I
have the PDF file name (C:\\Example.pdf)


*My question is: What is the correct way to call like() method when I
have to find similar PDF files?*

I use:
-------------------------------------------------------
MoreLikeThis mlt = new MoreLikeThis(IndexReader);        

Query query = mlt.like(*new File("C:\\Example.pdf")*);
-------------------------------------------------------

I don't sure It is the correct way because I think if I pass a file to
the like() method It is expected to receive a text file and not a PDF
file where the text is not visible...

Do I have to extract text from PDF file and then pass an InputStream
with the text inside? Or my way is ok?

Thanks for any suggestion,
Davide.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PDF documents with "MoreLikeThis" class

Reply via email to