[jira] [Comment Edited] (PDFBOX-3255) Reasonable way to handle missing characters in font

Christian Brandt (JIRA) Wed, 02 Mar 2016 09:39:22 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176044#comment-15176044
 ]


Christian Brandt edited comment on PDFBOX-3255 at 3/2/16 5:37 PM:
------------------------------------------------------------------

Hi!

I ended up having the following routine:

{code}
private void setStringValue(PDField field, String input) throws Exception
{
        /* Extract font name */
        String  da      = field.getCOSObject().getString(COSName.DA.getName());
        Matcher m       = Pattern.compile("/?(.*) [\\d]+ Tf.*", 
Pattern.CASE_INSENSITIVE).matcher(da);
        String  name    = m.find() ? m.group(1) : null;
        PDFont  font    = 
field.getAcroForm().getDefaultResources().getFont(COSName.getPDFName(name));

        if (font instanceof PDSimpleFont)
        {
                /* Walk through used characters and replace ones with space 
that can not be represented by the font */
                StringBuilder value = new StringBuilder();

                Encoding encoding = ((PDSimpleFont) font).getEncoding();

                for (int i=0;i<input.length();i++)
                {
                        char c = input.charAt(i);

                        if (".notdef".equals(encoding.getName(c)) == false)
                                value.append(c);
                        else
                                value.append(' ');
                }

                field.setValue(value.toString());
        }
        else
                field.setValue(input);
}
{code}

Despite the obvious performance issues, this seems to work at least with the 
test cases I tried. However,

1. It would be nice to use 
PDVariableText.getDefaultAppearanceString().getFont() to get the associated 
font instead of parsing the name manually and then fetching it from the 
resources, but the method is not accessible. Now I am just not sure if my regex 
covers all the possible cases.
2. Because the Encoding.contains('\u00AD') may return true (value ".notdef" 
seems to be stored), a string comparison is required which is not nice. This 
can be of course optimized a bit by the caller with lookup for recurring 
characters, but it would make life easier if we could get rid of the whole 
string comparison. Or at least it would be nice to be able to refer to 
Encoding.NOTDEF or something instead of hardcoding it to the caller code in 
order to prevent code to break should someone decide to change the value of the 
constant in the future.


was (Author: dadacafe):
Hi!

I ended up having the following routine:

{code}
private void setStringValue(PDField field, String input) throws Exception
{
        /* Extract font name */
        String  da      = field.getCOSObject().getString(COSName.DA.getName());
        Matcher m       = Pattern.compile("/?(.*) [\\d]+ Tf.*", 
Pattern.CASE_INSENSITIVE).matcher(da);
        String  name    = m.find() ? m.group(1) : null;
        PDFont  font    = 
field.getAcroForm().getDefaultResources().getFont(COSName.getPDFName(name));

        if (font instanceof PDSimpleFont)
        {
                /* Walk through used characters and replace ones with space 
that can not be represented by the font */
                StringBuilder value = new StringBuilder();

                Encoding encoding = ((PDSimpleFont) font).getEncoding();

                for (int i=0;i<input.length();i++)
                {
                        char c = input.charAt(i);

                        if (".notdef".equals(encoding.getName(c)) == false)
                                value.append(c);
                        else
                                value.append(' ');
                }

                field.setValue(value.toString());
        }
        else
                field.setValue(input);
}
{code}

Despite the obvious performance issues, this seems to work at least with the 
test cases I tried. However,

1. It would be nice to use 
PDVariableText.getDefaultAppearanceString().getFont() to get the associated 
font instead of parsing the name manually and then fetching it from the 
resources, but the method is not accessible. Now I am just not sure if my regex 
covers all the possible cases.
2. Because the Encoding.contains('\u00AD') may return true (value ".notdef" 
seems to be stored), a string comparison is required which is not nice. This 
can be of course optimized a bit by the caller with lookup for recurring 
characters, but it would make life easier if we could get rid of the whole 
string comparison. Or at least it would nice to be able to refer to 
Encoding.NOTDEF or something instead of hardcoding it to the caller code to 
prevent code to break should someone decide to change the value of the constant 
in the future.

> Reasonable way to handle missing characters in font
> ---------------------------------------------------
>
>                 Key: PDFBOX-3255
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3255
>             Project: PDFBox
>          Issue Type: Wish
>          Components: AcroForm
>    Affects Versions: 2.0.0
>            Reporter: Christian Brandt
>              Labels: newbie
>         Attachments: TEST.pdf
>
>
> Hello,
> We have an issue with setting form field values if the input contains 
> characters that cannot be rendered with the associated font. The system 
> throws similar exception to:
> java.lang.IllegalArgumentException: U+0308 ('dieresiscmb') is not available 
> in this font's encoding: MacRomanEncoding with differences
> Currently this is problematic to be handled outside the framework because 
> based on my understanding (please correct me if I'm wrong) the caller does 
> not have a way to figure out what font will be eventually used and therefore 
> which characters are not renderable.
> What we would ultimately like, is that the library would optionally replace 
> unrenderable characters with some another existing character (e.g. space) 
> instead of failing the call, or that the library would provide a way to 
> recover from this error so that the user would be able to call the method 
> again with altered input. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-3255) Reasonable way to handle missing characters in font

Reply via email to