[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Tilman Hausherr (JIRA) Thu, 15 Sep 2016 13:24:54 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3499:
------------------------------------
    Description: 
I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and 
Chinese characters, but for some reason it does parse it correctly. Every 
character that is extracted is changed to the first letter in the line. For 
example if the document contains 早上好, this, the extracted text will correctly 
know that it has 3 characters but all 3 characters will be 早早早, the last two 
characters are replaced by the first character. This same string is correctly 
parsed, in a word document.  I was trying to use this with Tika-13, which was 
is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 
2.0.3. And I still see the same problem. The following is the code I used.
{code}
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFBoxTesting
{
    private static PDFParser parser;
    private static PDFTextStripper pdfStripper;
    private static PDDocument pdDoc;
    private static COSDocument cosDoc;
    private static String Text;
    private static String filePath;
    private static File file;

    public static String ToText() throws IOException
    {
        pdfStripper = null;
        pdDoc = null;
        cosDoc = null;
        filePath = "C:\\Users\\kaleba\\Desktop\\nihao2.pdf";
        file = new File(filePath);
        parser = new PDFParser(new RandomAccessFile(file, "r")); // update for 
PDFBox V 2.0 
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        pdDoc.getNumberOfPages();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(10); // reading text from page 1 to 10 
        // if you want to get text from full pdf file use this code
        // pdfStripper.setEndPage(pdDoc.getNumberOfPages()); 
        Text = pdfStripper.getText(pdDoc); // put breakpoint after executing 
getTtext. 
        return Text;
    }

    public static void main(String[] args)
    {
        try
        {
            ToText();
        }
        catch (Exception e)
        {
            int i = 1;
        }
    }
}
{code}


  was:
I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and 
Chinese characters, but for some reason it does parse it correctly. Every 
character that is extracted is changed to the first letter in the line. For 
example if the document contains 早上好, this, the extracted text will correctly 
know that it has 3 characters but all 3 characters will be 早早早, the last two 
characters are replaced by the first character. This same string is correctly 
parsed, in a word document.  I was trying to use this with Tika-13, which was 
is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 
2.0.3. And I still see the same problem. The follwoing is the code I used.

mport java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFBoxTesting {
private static PDFParser parser;
private static PDFTextStripper pdfStripper;
private static PDDocument pdDoc ;
private static COSDocument cosDoc ;
private static String Text ;
private static String filePath;
private static File file;
public static String ToText() throws IOException
{ pdfStripper = null; pdDoc = null; cosDoc = null; filePath = 
"C:\\Users\\kaleba\\Desktop\\nihao2.pdf"; file = new File(filePath); parser = 
new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0 
parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new 
PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdDoc.getNumberOfPages(); 
pdfStripper.setStartPage(1); pdfStripper.setEndPage(10); // reading text from 
page 1 to 10 // if you want to get text from full pdf file use this code // 
pdfStripper.setEndPage(pdDoc.getNumberOfPages()); Text = 
pdfStripper.getText(pdDoc); // put breakpoint after executing getTtext. return 
Text; }
public static void main(String[] args) {
// TODO Auto-generated method stub
try
{ ToText(); }
catch (Exception e)
{ int i=1; }
}
}


> PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-3499
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3499
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.2
>            Reporter: Kaleb Akalework
>            Priority: Critical
>         Attachments: AppBody-Sample-Chinese.pdf, UnicodeTest.pdf, nihao2.pdf
>
>
> I'm trying to use PDFBox 2.0.2 to parse PDF files that contain Japanese and 
> Chinese characters, but for some reason it does parse it correctly. Every 
> character that is extracted is changed to the first letter in the line. For 
> example if the document contains 早上好, this, the extracted text will correctly 
> know that it has 3 characters but all 3 characters will be 早早早, the last two 
> characters are replaced by the first character. This same string is correctly 
> parsed, in a word document.  I was trying to use this with Tika-13, which was 
> is PDFBOX 2.0.2. Under Tim Allisons (From Tika) advice i tried it with PDFBOX 
> 2.0.3. And I still see the same problem. The following is the code I used.
> {code}
> import java.io.File;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.io.RandomAccessFile;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.text.PDFTextStripper;
> public class PDFBoxTesting
> {
>     private static PDFParser parser;
>     private static PDFTextStripper pdfStripper;
>     private static PDDocument pdDoc;
>     private static COSDocument cosDoc;
>     private static String Text;
>     private static String filePath;
>     private static File file;
>     public static String ToText() throws IOException
>     {
>         pdfStripper = null;
>         pdDoc = null;
>         cosDoc = null;
>         filePath = "C:\\Users\\kaleba\\Desktop\\nihao2.pdf";
>         file = new File(filePath);
>         parser = new PDFParser(new RandomAccessFile(file, "r")); // update 
> for PDFBox V 2.0 
>         parser.parse();
>         cosDoc = parser.getDocument();
>         pdfStripper = new PDFTextStripper();
>         pdDoc = new PDDocument(cosDoc);
>         pdDoc.getNumberOfPages();
>         pdfStripper.setStartPage(1);
>         pdfStripper.setEndPage(10); // reading text from page 1 to 10 
>         // if you want to get text from full pdf file use this code
>         // pdfStripper.setEndPage(pdDoc.getNumberOfPages()); 
>         Text = pdfStripper.getText(pdDoc); // put breakpoint after executing 
> getTtext. 
>         return Text;
>     }
>     public static void main(String[] args)
>     {
>         try
>         {
>             ToText();
>         }
>         catch (Exception e)
>         {
>             int i = 1;
>         }
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3499) PDFBox 2.0.2 not parsing Japanese and Chinese Characters correctly from PDF

Reply via email to