[jira] Commented: (PDFBOX-212) PDF Document cut German Umlauts

JIRA Wed, 21 Apr 2010 02:07:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859261#action_12859261
 ]


Bernd Köster commented on PDFBOX-212:
-------------------------------------

The reason for this problem is the call to getBytes() from String. These 
convert the String using the default Charset.

If you have an PDF encoded in Cp1252 and file.encoding is UTF8, you get ugly 
results inserting äöüßÄÖÜ in the fields.

Somewhere the encoding of the pdf-file should be stored. I did not find the 
position yet.

Perhaps these hints help.



> PDF Document cut German Umlauts
> -------------------------------
>
>                 Key: PDFBOX-212
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-212
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Writing
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1587745
> Originally submitted by kajiro on 2006-10-31 01:05.
> I use the class TextToPDF for create a PDF Document
> from a text file. That operates correctly with a simply
> text. But when i use german umlauts in the text like
> Ã¤,Ã¶,Ã¼ or Ã the PDF Document cut this letters. 
> Attached is a sample document contaning four words with
> incorrectly umlauts! 
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1587745&file_id=200742
> bsp.pdf (application/pdf), 958 bytes
> Umlauts are incorrect
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> To the anonymous poster, did you mean for both PDF links to be the same?
> Ben
> [comment on SourceForge]
> Originally sent by nobody.
> Logged In: NO 
> For PDF file, which contains accented Latin1
> characters:
>     http://acl.ldc.upenn.edu//P/P06/P06-2052.pdf
> I get a u with umlauts converted into "currency1u"
> (look at the first name on the first page).
> For the following file containing Japanese characters:
>      http://acl.ldc.upenn.edu//P/P06/P06-2052.pdf
> I get error:
>      java.io.IOException: Unknown encoding for 'H'
> I also can't seem to cut and past the form.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-212) PDF Document cut German Umlauts

Reply via email to