Re: Pdf in Lucene?

Kalani Ruwanpathirana Thu, 04 Dec 2008 05:46:24 -0800

Hi Tiziano,

What is the error you got? I think you can get the text easily using the
code shown below.



FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));
cd.close();

After getting the value for text you can simply create the Lucene document.

Document doc = new Document();
            doc.add(new Field("id", "2", Field.Store.YES,
Field.Index.TOKENIZED));
            doc.add(new Field("content", docText,Field.Store.NO,
Field.Index.TOKENIZED));


On Thu, Dec 4, 2008 at 6:20 PM, tiziano bernardi <[EMAIL PROTECTED]> wrote:

>
> Thanks very kind ...
> But I've tried that code but I do not work ...
> You could send me a simple working class that uses it please?
> Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: [EMAIL PROTECTED]> To:
> [email protected]> Subject: Re: Pdf in Lucene?> > Hi,> > In my
> case I used PDFBox, just to extract the text from PDF document and> then I
> created the Lucene document giving the extracted text. (I didn't use> the
> PDFBox built in Lucene search engine). So I didn't get any> incompatibility
> problems.> > This blog post shows the way.>
> http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html>
> > It worked perfect for me.> > Thanks.
> _________________________________________________________________
> Ci sai fare con l'italiano? Scoprilo con Typectionary!
> http://typectionary.it.msn.com/
>



-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Re: Pdf in Lucene?

Reply via email to