Re: extracting polish characters

Villu Ruusmann Thu, 08 Apr 2010 13:27:13 -0700

Hello there,

>
> I have a problem with extracting plain text from PDF documents that contain 
> polish characters.
> I am using the following approach to extract text:
>  ......
>
> The above code works fine in most cases. Text containing polish characters is 
> extracted correctly.
> There are, however, the .pdf files for witch the above method does not work. 
> Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. 
> Is there any way to fix this problem?
>


Your code looks fine to me, so that shouldn't be the problem. I
suspect that PDFBox is unable to decode characters (ie. the
problematic polish characters are outside of the most common US-ASCII
character set), but we should be able to get a sample PDF document on
our hands to conduct a more thorough investigation.

Could you open a JIRA issue and attach a sample PDF document there?


VR

Re: extracting polish characters

Reply via email to