>>Do I have to extract text from PDF file and then pass an InputStream with the >>text inside? Yes. Although technically you could pass the content unparsed it will contain a lot of unintelligible garbage in the form of markup and images.
All Lucene classes deliberately try and avoid the mucky business of parsing different specific document types. This keeps the core engine very tightly focused on indexing and searching without having to deal with the ever-changing range of document formats. ----- Original Message ---- From: Davide <[EMAIL PROTECTED]> To: [email protected] Sent: Thursday, 20 July, 2006 10:41:03 AM Subject: PDF documents with "MoreLikeThis" class Hi, I'm using MoreLikeThis class to find similar documents... but I'm not sure if it is correct to pass as argument a Pdf file to *MoreLikeThis.like()* method. Trying to be more clear: 1) In my Lucene index I add some PDF files (I use PDFBox to extract text and add fields to index) 2) Now I want to search similar documents from a specific PDF file and I have the PDF file name (C:\\Example.pdf) *My question is: What is the correct way to call like() method when I have to find similar PDF files?* I use: ------------------------------------------------------- MoreLikeThis mlt = new MoreLikeThis(IndexReader); Query query = mlt.like(*new File("C:\\Example.pdf")*); ------------------------------------------------------- I don't sure It is the correct way because I think if I pass a file to the like() method It is expected to receive a text file and not a PDF file where the text is not visible... Do I have to extract text from PDF file and then pass an InputStream with the text inside? Or my way is ok? Thanks for any suggestion, Davide. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
