[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Kaleb Akalework (JIRA) Thu, 15 Sep 2016 12:59:32 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kaleb Akalework updated PDFBOX-3499:
------------------------------------
    Attachment: UnicodeTest.pdf

> PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-3499
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3499
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.2
>            Reporter: Kaleb Akalework
>            Priority: Critical
>         Attachments: UnicodeTest.pdf, nihao2.pdf
>
>
> I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and 
> Chinese characters, but for some reason it does parse it correctly. Every 
> character that is extracted is changed to the first letter in the line. For 
> example if the document contains 早上好, this, the extracted text will correctly 
> know that it has 3 characters but all 3 characters will be 早早早, the last two 
> characters are replaced by the first character. This same string is correctly 
> parsed, in a word document.  I was trying to use this with Tika-13, which was 
> is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 
> 2.0.3. And I still see the same problem. The follwoing is the code I used.
> mport java.io.File;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.io.RandomAccessFile;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.text.PDFTextStripper;
> public class PDFBoxTesting {
> private static PDFParser parser;
> private static PDFTextStripper pdfStripper;
> private static PDDocument pdDoc ;
> private static COSDocument cosDoc ;
> private static String Text ;
> private static String filePath;
> private static File file;
> public static String ToText() throws IOException
> { pdfStripper = null; pdDoc = null; cosDoc = null; filePath = 
> "C:\\Users\\kaleba\\Desktop\\nihao2.pdf"; file = new File(filePath); parser = 
> new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0 
> parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new 
> PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdDoc.getNumberOfPages(); 
> pdfStripper.setStartPage(1); pdfStripper.setEndPage(10); // reading text from 
> page 1 to 10 // if you want to get text from full pdf file use this code // 
> pdfStripper.setEndPage(pdDoc.getNumberOfPages()); Text = 
> pdfStripper.getText(pdDoc); // put breakpoint after executing getTtext. 
> return Text; }
> public static void main(String[] args) {
> // TODO Auto-generated method stub
> try
> { ToText(); }
> catch (Exception e)
> { int i=1; }
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Reply via email to