Hello there, > > I have a problem with extracting plain text from PDF documents that contain > polish characters. > I am using the following approach to extract text: > ...... > > The above code works fine in most cases. Text containing polish characters is > extracted correctly. > There are, however, the .pdf files for witch the above method does not work. > Polish characters are replaced. E.g. polish crossed l (ł) is replaced by %. > Is there any way to fix this problem? >
Your code looks fine to me, so that shouldn't be the problem. I suspect that PDFBox is unable to decode characters (ie. the problematic polish characters are outside of the most common US-ASCII character set), but we should be able to get a sample PDF document on our hands to conduct a more thorough investigation. Could you open a JIRA issue and attach a sample PDF document there? VR

