[
https://issues.apache.org/jira/browse/PDFBOX-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kaleb Akalework updated PDFBOX-3499:
------------------------------------
Attachment: UnicodeTest.pdf
> PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF
> ---------------------------------------------------------------------------
>
> Key: PDFBOX-3499
> URL: https://issues.apache.org/jira/browse/PDFBOX-3499
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.2
> Reporter: Kaleb Akalework
> Priority: Critical
> Attachments: UnicodeTest.pdf, nihao2.pdf
>
>
> I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and
> Chinese characters, but for some reason it does parse it correctly. Every
> character that is extracted is changed to the first letter in the line. For
> example if the document contains 早上好, this, the extracted text will correctly
> know that it has 3 characters but all 3 characters will be 早早早, the last two
> characters are replaced by the first character. This same string is correctly
> parsed, in a word document. I was trying to use this with Tika-13, which was
> is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX
> 2.0.3. And I still see the same problem. The follwoing is the code I used.
> mport java.io.File;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.io.RandomAccessFile;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.text.PDFTextStripper;
> public class PDFBoxTesting {
> private static PDFParser parser;
> private static PDFTextStripper pdfStripper;
> private static PDDocument pdDoc ;
> private static COSDocument cosDoc ;
> private static String Text ;
> private static String filePath;
> private static File file;
> public static String ToText() throws IOException
> { pdfStripper = null; pdDoc = null; cosDoc = null; filePath =
> "C:\\Users\\kaleba\\Desktop\\nihao2.pdf"; file = new File(filePath); parser =
> new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
> parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new
> PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdDoc.getNumberOfPages();
> pdfStripper.setStartPage(1); pdfStripper.setEndPage(10); // reading text from
> page 1 to 10 // if you want to get text from full pdf file use this code //
> pdfStripper.setEndPage(pdDoc.getNumberOfPages()); Text =
> pdfStripper.getText(pdDoc); // put breakpoint after executing getTtext.
> return Text; }
> public static void main(String[] args) {
> // TODO Auto-generated method stub
> try
> { ToText(); }
> catch (Exception e)
> { int i=1; }
> }
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]