Kaleb Akalework created PDFBOX-3499:
---------------------------------------
Summary: PDFBox 2.0.2 not parsing Japanese and Chinese Characters
correctly from PDF
Key: PDFBOX-3499
URL: https://issues.apache.org/jira/browse/PDFBOX-3499
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.2
Reporter: Kaleb Akalework
I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and
Chinese characters, but for some reason it does parse it correctly. Every
character that is extracted is changed to the first letter in the line. For
example if the document contains 早上好, this, the extracted text will correctly
know that it has 3 characters but all 3 characters will be 早早早, the last two
characters are replaced by the first character. This same string is correctly
parsed, in a word document. I was trying to use this with Tika-13, which was
is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX
2.0.3. And I still see the same problem. The follwoing is the code I used.
mport java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFBoxTesting {
private static PDFParser parser;
private static PDFTextStripper pdfStripper;
private static PDDocument pdDoc ;
private static COSDocument cosDoc ;
private static String Text ;
private static String filePath;
private static File file;
public static String ToText() throws IOException
{ pdfStripper = null; pdDoc = null; cosDoc = null; filePath =
"C:\\Users\\kaleba\\Desktop\\nihao2.pdf"; file = new File(filePath); parser =
new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new
PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1); pdfStripper.setEndPage(10); // reading text from
page 1 to 10 // if you want to get text from full pdf file use this code //
pdfStripper.setEndPage(pdDoc.getNumberOfPages()); Text =
pdfStripper.getText(pdDoc); // put breakpoint after executing getTtext. return
Text; }
public static void main(String[] args) {
// TODO Auto-generated method stub
try
{ ToText(); }
catch (Exception e)
{ int i=1; }
}
}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]