Hi, When I execute PDFStreamEngine.processStream(PDPage, PDResources, COSStream) I see very weird behaviour on the TextPosition's. Every TextPosition which has to be a 'space' exists of multiple characters (TextPosition.getCharacter): 9, 13, 32, 160
When I look in the code for filling the cmap (via debugger) of the font, I see a byte array of: [0, 9, 0, 13, 0, 32, 0, -96] which is interpreted as a String with UTF-16BE encoding. Huh? -96? Copy paste the text on Windows via Adobe Reader 'adds' newline on every space (paste to notepad). Repoduce: Simple document created in Word for Mac (newest version) and using font Cambria. The document contains only 'a a'. Saving the document as PDF (via Save-As). When using the font Verdana in stead of Cambria the problem NOT exists. Doing the same on Word for Windows, the problem NOT exists. So my conclusion is that it is an issue on Word for Mac with the Cambria font. Can anyone confirm that? But next, my PDFBox code has to handle it correctly. What is a safe assumption? Can I safely assume that when multiple characters are returned from TextPosition.getCharacter this can be ignored? Or look for specific byte order ending with the -96? Kind regards, Cornelis Hoeflake

