[
https://issues.apache.org/jira/browse/PDFBOX-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259196#comment-13259196
]
Eric Leleu commented on PDFBOX-1279:
------------------------------------
Hi,
In the PDF Reference, we can read :
"... PDF can be entirely represented using byte values corresponding to the
visible printable subset of the ASCII character set, plus white space
characters such as space, tab, carriage return, and line feed characters. ASCII
is the American Standard Code for Information Interchange, a widely used
convention for encoding a specific set of 128 characters as binary numbers.
However, a PDF file is not restricted to the ASCII character set; it can
contain arbitrary 8-bit bytes,..."
So there are no recommended Charset... However instead of UTF-8, the default
one should be US-ASCII or ISO-8859-1.
The problem comes from the comment line containing at least 4 binary characters
(code >= 128) that comes just after the header line. As far as I remember, to
match binary characters in JavaCC we must describe them using the Unicode
notation (\uxxxx). With the charset CP1252, the character <9F> can't match with
the token BINARY([\u0080-\u00FF]), because it is linked with the unicode
character \u0178. (See
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT)
So we have 3 possibilities :
[1] - Find a way to specify binary charaters without unicode notation in JavaCC
[2] - Add all unicode exceptions for the Cp1252 in the Binary token description
[3] - Update the BINARY token with [\u0080-\uFFFF] to avoid others charset
specificities.
I prefer the first one, but if we can't do it maybe the third one will be the
best to avoid further issues.
With following encodings, I run all my test set with the third option
successfully :
- US-ASCII
- Cp1252
- ISO-8859-1
- utf8
BR,
Eric
> Preflight reports "1.1 : Body Syntax error"
> -------------------------------------------
>
> Key: PDFBOX-1279
> URL: https://issues.apache.org/jira/browse/PDFBOX-1279
> Project: PDFBox
> Issue Type: Bug
> Components: Preflight
> Affects Versions: 1.7.0
> Environment: Win 7 64Bit, jre 1.6.31
> Reporter: beat weisskopf
> Priority: Minor
> Fix For: 1.7.0
>
> Attachments: input_pdf_a_lvl_a_libreoffice_352.pdf,
> pdfbox_1279_cs.patch
>
>
> Just tried the PDF/A Validation. It fails on the attached pdf with "1.1 :
> Body Syntax error". Adobe Preflight reports success for both pdf/a level a
> and pdf/a level b validation. PDF was created with plain LibreOffice 3.5.2
> (export as pdf, using pdf/a level a).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira