Hi,
I use this code for extracting the text of my pdf files for adding them to
the lucene index:
public Reader extractText(InputStream stream,
String type,
String encoding) throws IOException {
try {
PDFParser parser = new PDFParser(new
BufferedInputStream(stream));
try {
parser.parse();
PDDocument document = parser.getPDDocument();
CharArrayWriter writer = new CharArrayWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator("\n");
stripper.writeText(document, writer);
return new CharArrayReader(writer.toCharArray());
} finally {
try {
PDDocument doc = parser.getPDDocument();
if (doc != null) {
doc.close();
}
} catch (IOException e) {
// ignore
}
}
} catch (Throwable e) {
logger.log(Level.WARNING, "Failed to extract PDF text content",
e);
return new StringReader("");
} finally {
stream.close();
}
}
2008/12/10 NiTiN <[EMAIL PROTECTED]>
> Hi,
>
> i dont know how to extract all content of given pdf file using pdfbox,
> Please give me proper direction for that..
>
>
> Thank you ,
> NiTiN
>
--
Mit freundlichen Grüßen
Daniel Manzke