[jira] [Created] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Kaleb Akalework (JIRA) Thu, 15 Sep 2016 12:22:40 -0700

Kaleb Akalework created PDFBOX-3499:
---------------------------------------


             Summary: PDFBox 2.0.2 not parsing Japanese and Chinese Characters 
correctly from PDF
                 Key: PDFBOX-3499
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3499
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.2
            Reporter: Kaleb Akalework


I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and 
Chinese characters, but for some reason it does parse it correctly. Every 
character that is extracted is changed to the first letter in the line. For 
example if the document contains 早上好, this, the extracted text will correctly 
know that it has 3 characters but all 3 characters will be 早早早, the last two 
characters are replaced by the first character. This same string is correctly 
parsed, in a word document.  I was trying to use this with Tika-13, which was 
is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 
2.0.3. And I still see the same problem. The follwoing is the code I used.

mport java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFBoxTesting {
private static PDFParser parser;
private static PDFTextStripper pdfStripper;
private static PDDocument pdDoc ;
private static COSDocument cosDoc ;
private static String Text ;
private static String filePath;
private static File file;
public static String ToText() throws IOException
{ pdfStripper = null; pdDoc = null; cosDoc = null; filePath = 
"C:\\Users\\kaleba\\Desktop\\nihao2.pdf"; file = new File(filePath); parser = 
new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0 
parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new 
PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdDoc.getNumberOfPages(); 
pdfStripper.setStartPage(1); pdfStripper.setEndPage(10); // reading text from 
page 1 to 10 // if you want to get text from full pdf file use this code // 
pdfStripper.setEndPage(pdDoc.getNumberOfPages()); Text = 
pdfStripper.getText(pdDoc); // put breakpoint after executing getTtext. return 
Text; }
public static void main(String[] args) {
// TODO Auto-generated method stub
try
{ ToText(); }
catch (Exception e)
{ int i=1; }
}
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Reply via email to