Non-Ascii characters messed up in AcroForm (PdbBox 1.8.4)

Pasi.Koski Fri, 11 Apr 2014 01:10:13 -0700

Hi.

I'm working on a Java server side application which produces PDF forms which 
are pre-filled by the application. These documents are delivered to the end 
user via a browser interface after which the end user continues to edit the 
forms. Usually the forms are then printed by the end user or just saved 
electronically. No additional processing of the user input by the application 
is needed, although this may be a future scenario.


The problem is with displaying non-ascii characters in editable fields. When 
the data entered by the application in a form field contains non-ascii 
characters, they do not show up correctly once the document is opened in a PDF 
viewer. However, when the field is selected, the content is displayed 
correctly. If the data is changed, it will continue to display correctly after 
selecting another field, but if left unchanged, non-ascii characters return to 
the messed up state when the user moves out of the field.

I'm using PDFBox 1.8.4, but I had the same problem with the previous version 
(1.8.3). I have not tried earlier versions.

Can anyone tell me if non-ascii characters are supposed to work properly in an 
AcroForm field? What requirements does this pose on the PDF template? Do I need 
to encode the data before setting as the value of the PDField? If so, what 
encoding method to use?

Below is a simplified code sample of what I'm doing, from end-to-end. I've 
tried various alternatives in setting the encoding of the value of the field 
and I've made attempts to control the font setting via the DA dictionary 
parameter, but with no success. In most cases the read-only value turned out 
invisible, while selecting the field would display the data correctly.

//MyPdfCreator:
String TEMPLATE_NAME = "Form_13349A.pdf";
InputStream is = 
this.getClass().getClassLoader().getResourceAsStream(TEMPLATE_NAME);
pdfTemplate = PDDocument.load(is);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField("Field1");
String valueWithNonAsciiChars = "ÄÅÖöäå";
field.setValue(valueWithNonAsciiChars);
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
pdfTemplate.save(byteArrayOutputStream);
pdfTemplate.close();
byte[] pdf = byteArrayOutputStream.toByteArray();

//MyHttpRequestHandler:
ByteArrayOutputStream baos = new ByteArrayOutputStream(pdf.length);
baos.write(pdf, 0, pdf.length);
resourceResponse.setContentType("application/pdf");
resourceResponse.addProperty(HttpHeaders.CONTENT_DISPOSITION, "attachment; 
filename=Form_13349A.pdf");
resourceResponse.setContentLength(baos.size());
OutputStream out = resourceResponse.getPortletOutputStream();
baos.writeTo(out);
out.flush();
out.close();

Every hint I've found on the Internet suggest that it's a font related problem. 
But frankly, it seems like PdfBox is messing up the textField properties while 
setting the value. I found a couple of descriptions matching my problem, but no 
solution. PDFBOX-283 issue seems to be talking about the same problem, and 
there is even a patch attached, but apparently the fix has other unwanted side 
effects or why was it not added to the latest version? I have not tested the 
patch yet, but I probably will shortly.
https://issues.apache.org/jira/browse/PDFBOX-283

As a temporary fix, I was able to produce a successful result by editing the 
template PDF, by setting the Custom Format Script (that's what Adobe XI calls 
it) of the field like so:

var txtField = event.target;
txtField.textFont = font.Helv;
txtField.textColor = color.black;

HOWEVER, this only works with Adobe Reader, not the built-in reader with Chrome 
or Firefox. Plus, this is not a very nice fix since it requires the PDF 
template designer to remember to copy the script into the Custom Format Script 
entry for each and every field in each and every PDF template. Most importantly 
though, the solution should support every major PDF viewer.


Help would be very much appreciated!

Pasi Koski

Non-Ascii characters messed up in AcroForm (PdbBox 1.8.4)

Reply via email to