[ 
https://issues.apache.org/jira/browse/PDFBOX-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir updated PDFBOX-5213:
-----------------------------
    Description: 
Since version 2.0.22

PDFTextStripper adds next line symbol after sup values.

Like earlier

"Other (12) 1,505 832"

Now:

"Other (12)
 1,505 832"

 

You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html 
(2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)

 

You can use next simple code, to reproduce code. pageBytes is file 
GS-2010-q4-earnings.pdf

List<String> pages = new ArrayList<>();

PDDocument pdDocument = null;
 try {
 String pass = "";
 PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
 pdDocument = parser.parse();

int numberOfPages = pdDocument.getNumberOfPages();
 if (limit < numberOfPages) {
 numberOfPages = limit;
 }

// //

for (int i = 0; i < numberOfPages; i++) {
 PDFTextStripper stripper = new PDFTextStripper();
 stripper.setStartPage(i + 1);
 stripper.setEndPage(i + 1);

pages.add(stripper.getText(pdDocument));
 }
 } catch (Exception e) {
 log.error(e.getMessage(), e);
 Logger.DEBUG_MAIN.logError(e);
 } finally {
 if (pdDocument != null) {
 try {
 pdDocument.close();
 } catch (IOException e) {
 log.error(e.getMessage(), e);
 Logger.DEBUG_MAIN.logError(e);
 }
 }
 }

 

 

 

  was:
Since version 2.0.22

PDFTextStripper adds next line symbol after sup values.

Like earlier

"Other (12) 1,505 832"

Now:

"Other (12)
1,505 832"

 

You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html 
(2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)

 

 

 


> PDFTextStripper adds next line symbol after sup values (regression) 
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-5213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5213
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22, 2.0.23, 2.0.24
>            Reporter: Vladimir
>            Priority: Minor
>             Fix For: 2.0.21
>
>         Attachments: GS-2010-q4-earnings.pdf, 
> GS-2010-q4-earnings.pdf_expected.html, GS-2010-q4-earnings.pdf_result.html
>
>
> Since version 2.0.22
> PDFTextStripper adds next line symbol after sup values.
> Like earlier
> "Other (12) 1,505 832"
> Now:
> "Other (12)
>  1,505 832"
>  
> You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html 
> (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and 
> higher)
>  
> You can use next simple code, to reproduce code. pageBytes is file 
> GS-2010-q4-earnings.pdf
> List<String> pages = new ArrayList<>();
> PDDocument pdDocument = null;
>  try {
>  String pass = "";
>  PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), 
> pass);
>  pdDocument = parser.parse();
> int numberOfPages = pdDocument.getNumberOfPages();
>  if (limit < numberOfPages) {
>  numberOfPages = limit;
>  }
> // //
> for (int i = 0; i < numberOfPages; i++) {
>  PDFTextStripper stripper = new PDFTextStripper();
>  stripper.setStartPage(i + 1);
>  stripper.setEndPage(i + 1);
> pages.add(stripper.getText(pdDocument));
>  }
>  } catch (Exception e) {
>  log.error(e.getMessage(), e);
>  Logger.DEBUG_MAIN.logError(e);
>  } finally {
>  if (pdDocument != null) {
>  try {
>  pdDocument.close();
>  } catch (IOException e) {
>  log.error(e.getMessage(), e);
>  Logger.DEBUG_MAIN.logError(e);
>  }
>  }
>  }
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to