extracting polish characters

Piotr Rychlik Thu, 08 Apr 2010 13:07:15 -0700

Hi,

I have a problem with extracting plain text from PDF documents that contain 
polish characters.
I am using the following approach to extract text:
 ......
   File f = new File(fileName);


 PDFParser parser = new PDFParser(new FileInputStream(f));
 parser.parse();

 COSDocument cosDoc = parser.getDocument();
 PDFTextStripper pdfStripper = new PDFTextStripper();
 PDDocument pdDoc = new PDDocument(cosDoc);
 String parsedText = pdfStripper.getText(pdDoc);
 ......

parsedText is then written to a file using UTF8 encoding.

The above code works fine in most cases. Text containing polish characters is 
extracted correctly.
There are, however, the .pdf files for witch the above method does not work. 
Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. Is 
there any way to fix this problem?

Regards,
Piotr Rychlik

extracting polish characters

Reply via email to