Re: [discuss] "No glyph for U+... in font ...."

John Hewson Thu, 03 Nov 2016 13:10:17 -0700

> On 3 Nov 2016, at 12:16, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
> 
>> 
>> Am 03.11.2016 um 17:33 schrieb John Hewson <j...@jahewson.com>:
>> 
>> 
>>> On 3 Nov 2016, at 02:11, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
>>> 
>>> Hi,
>>> 
>>> a very common issue on the mailing list as well as SO is the 
>>> IllegalArgumentException people get if one tries to use a character with a 
>>> font which has no support for that. Could we lower the requirement here to 
>>> instead of throwing a exception use a replacement char and log a warning? 
>>> Other ideas?
>> 
>> The existing design of PDFBox assumes a unique unicode char -> glyph mapping 
>> exists with respect to font embedding. This is because we don’t do any 
>> complex text layout, so it’s a safe assumption, for now. Note that the 
>> mapping of “missing unicode char” -> .notdef is not unique if there’s more 
>> than one missing char, so there’s no simple fix for supporting missing 
>> characters with even .notdef.
>> 
>> I don’t think we should be silently generating broken PDFs - nobody benefits 
>> from such a PDF. Dumping the problem on some unfortunate end user further 
>> down the line is not a solution :)
>> 
>> But we 100% should do something to improve this when we implement complex 
>> text layout, because that already requires that we abandon the concept of a 
>> unique char -> glyph mapping and replace it with something more 
>> sophisticated. One thing this will allow us to solve is the issue of missing 
>> whitespace glyphs such as tab and nbsp, because we’ll finally be able to map 
>> them to a space.
>> 
>> In hindsight, we really should have thrown a checked exception for missing 
>> glyphs, because IllegalArgumentException causes people to think that PDFBox 
>> is broken, when in fact the problem is with their code.
> 
> can't we go for an IOException as that's part of the public API (at least for 
> most of the PDFont related classes). 
> 
> PDFont has protected abstract byte[] encode(int unicode) throws IOException;
> PDType0Font has protected byte[] encode(int unicode) throws IOException;
> PDCIDFont has protected abstract byte[] encode(int unicode) throws 
> IOException;
> ...
> 
> PDCIDFontType2 has public byte[] encode(int unicode)
> 
> WDYT?


This feels like a misuse of the “IO” part of IOException. We do use IOException 
a lot, but to mean that parsing or writing failed, for example encode throws an 
IOException because the font and encoding files it needs to read from might be 
bad or inaccessible, not because encoding might fail due to an unsupported 
input.

For me the most important thing is that we need users to be prepared to handle 
missing glyphs in their code, and a checked exception would certainly do that, 
but using an IOException to indicate bad input is going to get lost amongst all 
the other IOExceptions which are used to indicate actual I/O or parsing errors.

— John

> Maruan
> 
> 
>> 
>> — John
>> 
>>> I know that the current behaviour has been introduced to ensure a 
>>> consistent PDF but it might be a little difficult to handle. 
>>> 
>>> BR
>>> Maruan
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org 
>>> <mailto:dev-unsubscr...@pdfbox.apache.org>
>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org 
>>> <mailto:dev-h...@pdfbox.apache.org>
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org 
>> <mailto:dev-unsubscr...@pdfbox.apache.org>
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org 
>> <mailto:dev-h...@pdfbox.apache.org>
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org 
> <mailto:dev-unsubscr...@pdfbox.apache.org>
> For additional commands, e-mail: dev-h...@pdfbox.apache.org 
> <mailto:dev-h...@pdfbox.apache.org>

Re: [discuss] "No glyph for U+... in font ...."

Reply via email to