[jira] [Comment Edited] (PDFBOX-4275) Can't extract slanted text through the parsers of the PDFBox

Soocheon Kim (JIRA) Tue, 31 Jul 2018 08:07:29 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563782#comment-16563782
 ]


Soocheon Kim edited comment on PDFBOX-4275 at 7/31/18 3:06 PM:
---------------------------------------------------------------

(I'm sorry that I couldn't do anything for a few days)

Pdfbox can't extract the diagonal texts. I had a test today, too.

 

1. The content of my pdf file is like this:

Unable to render embedded object:

 

2. My test program is as follow..

  PDDocument doc = null;
   
   try {

    doc = PDDocument.load(file);

    PDFTextStripper parser = new PDFTextStripper();

    String text = parser.getText(doc);

    System.out.println(text);

  } finally {

    if (doc != null)

      doc.close();

  }

 

3. The results are as follow..

1111
 5
 5
 5
 5
 7777
 9
 9
 9
 9

 

PDFTextStripper extracts only texts rotated 90, 180, 270 degrees.

PDFStreamEngine.showGlyph(...) is same.

 

(The attached pdf file was created from MS Powerpoint)


was (Author: ksc0524):
(I'm sorry that I couldn't do anything for a few days)

Pdfbox can't extract the diagonal texts. I had a test today, too.

 

1. The content of my pdf file is like this:

!image-2018-07-31-23-38-32-829.png!

2. My test program is as follow..

  PDDocument doc = null;
   
   try

{    doc = PDDocument.load(file);    PDFTextStripper parser = new 
PDFTextStripper();    String text = parser.getText(doc);        
System.out.println(text);   }

finally

{    if (doc != null)     doc.close();   }

 

3. The results are as follow..

1111
 5
 5
 5
 5
 7777
 9
 9
 9
 9

 

PDFTextStripper extracts only texts rotated 90, 180, 270 degrees.

PDFStreamEngine.showGlyph(...) is same.

 

(The attached pdf file was created from MS Powerpoint)

> Can't extract slanted text through the parsers of the PDFBox
> ------------------------------------------------------------
>
>                 Key: PDFBOX-4275
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4275
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Text extraction
>    Affects Versions: 2.0.10
>         Environment: I tested that in the overried showGlyph() method of my 
> class extending  PDFStreamEngine, PDFGraphicsStreamEngine or PDFTextStripper.
>            Reporter: Soocheon Kim
>            Priority: Major
>         Attachments: image-2018-08-01-00-06-52-824.png, rotation.pdf
>
>
> The PDFBox (StreamEngine) extracts only texts that are rotated by 0, 90, 180 
> or -90 degrees.
> For example, it can't extract texts rotated by 45 or 60 degrees.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4275) Can't extract slanted text through the parsers of the PDFBox

Reply via email to