[jira] [Comment Edited] (PDFBOX-4275) Can't extract slanted text through the parsers of the PDFBox

Soocheon Kim (JIRA) Tue, 31 Jul 2018 08:04:32 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563782#comment-16563782
 ]


Soocheon Kim edited comment on PDFBOX-4275 at 7/31/18 3:03 PM:
---------------------------------------------------------------

(I'm sorry that I couldn't do anything for a few days)

Pdfbox can't extract the diagonal texts. I had a test today, too.

 

1. The content of my pdf file is like this:

!image-2018-07-31-23-38-32-829.png!

2. My test program is as follow..

  PDDocument doc = null;
   
   try

{    doc = PDDocument.load(file);    PDFTextStripper parser = new 
PDFTextStripper();    String text = parser.getText(doc);        
System.out.println(text);   }

finally

{    if (doc != null)     doc.close();   }

 

3. The results are as follow..

1111
 5
 5
 5
 5
 7777
 9
 9
 9
 9

 

PDFTextStripper extracts only texts rotated 90, 180, 270 degrees.

PDFStreamEngine.showGlyph(...) is same.

 

(The attached pdf file was created from MS Powerpoint)


was (Author: ksc0524):
(I'm sorry that I couldn't do anything for a few days)

Pdfbox can't extract the diagonal texts. I had a test today, too.

 

1. The content of my pdf file is like this:

!image-2018-07-31-23-38-32-829.png!

2. My test program is as follow..

  PDDocument doc = null;
  
  try {
   doc = PDDocument.load(file);
   PDFTextStripper parser = new PDFTextStripper();
   String text = parser.getText(doc);
   
   System.out.println(text);
  } finally {
   if (doc != null)
    doc.close();
  }

 

3. The results are as follow..

1111
5
5
5
5
7777
9
9
9
9

 

PDFTextStripper extracts only texts rotated 90, 180, 270 degrees.

PDFStreamEngine.showGlyph(...) does also.

> Can't extract slanted text through the parsers of the PDFBox
> ------------------------------------------------------------
>
>                 Key: PDFBOX-4275
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4275
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Text extraction
>    Affects Versions: 2.0.10
>         Environment: I tested that in the overried showGlyph() method of my 
> class extending  PDFStreamEngine, PDFGraphicsStreamEngine or PDFTextStripper.
>            Reporter: Soocheon Kim
>            Priority: Major
>         Attachments: rotation.pdf
>
>
> The PDFBox (StreamEngine) extracts only texts that are rotated by 0, 90, 180 
> or -90 degrees.
> For example, it can't extract texts rotated by 45 or 60 degrees.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4275) Can't extract slanted text through the parsers of the PDFBox

Reply via email to