[jira] [Reopened] (PDFBOX-4275) Can't extract slanted text through the parsers of the PDFBox

Soocheon Kim (JIRA) Tue, 31 Jul 2018 07:49:24 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Soocheon Kim reopened PDFBOX-4275:
----------------------------------

(I'm sorry that I couldn't do anything for a few days)

Pdfbox can't extract the diagonal texts. I had a test today, too.

 

1. The content of my pdf file is like this:

!image-2018-07-31-23-38-32-829.png!

2. My test program is as follow..

  PDDocument doc = null;
  
  try {
   doc = PDDocument.load(file);
   PDFTextStripper parser = new PDFTextStripper();
   String text = parser.getText(doc);
   
   System.out.println(text);
  } finally {
   if (doc != null)
    doc.close();
  }

 

3. The results are as follow..

1111
5
5
5
5
7777
9
9
9
9

 

PDFTextStripper extracts only texts rotated 90, 180, 270 degrees.

PDFStreamEngine.showGlyph(...) does also.

> Can't extract slanted text through the parsers of the PDFBox
> ------------------------------------------------------------
>
>                 Key: PDFBOX-4275
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4275
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Text extraction
>    Affects Versions: 2.0.10
>         Environment: I tested that in the overried showGlyph() method of my 
> class extending  PDFStreamEngine, PDFGraphicsStreamEngine or PDFTextStripper.
>            Reporter: Soocheon Kim
>            Priority: Major
>
> The PDFBox (StreamEngine) extracts only texts that are rotated by 0, 90, 180 
> or -90 degrees.
> For example, it can't extract texts rotated by 45 or 60 degrees.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Reopened] (PDFBOX-4275) Can't extract slanted text through the parsers of the PDFBox

Reply via email to