Hi,
Yushuang Hao <[email protected]> hat am 11. Juli 2012 um 12:08 geschrieben: > Dear Sir/Madam, > > I experienced two issues when I was using the PDFBOX 1.7.0 to convert the > PDF to Text: > > Firstly, the PDF is purely in English but after conversion I get random CJK > characters in it. I have figured out this as under UTF-8 the Latin > character takes 1 bit ranging from 0x0000 to 0x00FF in Unicode, somehow the > conversion randomly compressed two Latin characters together as a 2 bits > CJK character. For example, I got "?" (0x5365) rather than getting > "S"(0x0053) and "e"(0x0065). I don't know how this happened but I managed > to convert this to the right ones. > > My second issue is in the same document the "?" was produced for where it > should be 3,4,6,7,8,9,),* or %, see below example. Can you give me some > hints how to solve this? Many thanks. Hmm, it's not that easy to say without having a hand on the pdf. If you can share the doc in question with us, create an issue on JIRA [1] and attach the pdf to it. > > In PDF: > TERM C1 EUR 591736DB6 LX038684 07-Jun-2016 Shadow Shadow 450.0 0.00 0.404 > 4.9040 0.00 0.00 462,025.59 462,025.59 > > Conversion: > 07-Jun-201?TERM C1 EUR 462,025.5?Shadow Shadow 0.00 0.40? 450.0 > 0.00591736DB? 4.9040 0.00 462,025.5?LX03868? Looks like you are not using the sort-option, are you? > > Kind regards, > Yushuang > > -- > > *Yushuang Hao* > Codean > King's Gate > 1 Bravingtons Walk > London, N1 9AE, UK > [email protected] > > tel. +44 (0)20 3475 3548 > mob. +44 (0)7973 816 879 > > www.codean.com BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX

