[jira] [Assigned] (PDFBOX-1874) PDFTextStripper.isParagraphSeparation(...)

JIRA Tue, 09 Dec 2014 03:17:44 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler reassigned PDFBOX-1874:
------------------------------------------

    Assignee: Andreas Lehmkühler

> PDFTextStripper.isParagraphSeparation(...)
> ------------------------------------------
>
>                 Key: PDFBOX-1874
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1874
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>         Environment: Eclipse
>            Reporter: Yuri Burrows
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>              Labels: patch
>
> PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it 
> finds Y text indentation.
> PROBLEM:
> I believe the issue is due to precision in the the following logic:
>             float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
>                     lastPosition.getTextPosition().getYDirAdj());
>             float xGap = (position.getTextPosition().getXDirAdj()-
>                     lastLineStartPosition.getTextPosition().getXDirAdj());
>             if(yGap > (getDropThreshold()*maxHeightForLine))
>             {
>                         result = true;
> yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) 
> has a precision to 100,000th. Resulting in the following comparison (example):
> 16.018 > 16.018005
> which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".
> EFFECT OF THE PROBLEM:
> every line in the output is marked as "isParagraphStart = true" and 
> "writeParagraphEnd() ... = true".
> I.E. 
> |||NEW_LINE|||
> |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents 
> using familiar object-oriented paradigms. The data|||NEW_LINE|||
> contained in a PDF document is a collection of basic object types: arrays, 
> booleans, dictionaries, numbers,|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic 
> object types in the org.pdfbox.cos package (the|||NEW_LINE|||
> COS Model). While it's possible to create any desired interactions with a PDF 
> document using only these|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> In the source PDF these lines appear as such:
> "PDFBox has been designed to represent PDF documents using familiar 
> object-oriented paradigms. The data
> contained in a PDF document is a collection of basic object types: arrays, 
> booleans, dictionaries, numbers,
> strings and binary streams. PDFBox captures these basic object types in the 
> org.pdfbox.cos package (the
> COS Model). While it's possible to create any desired interactions with a PDF 
> document using only these"
> MY WORKAROUND:
> NOTE: there is a small performance hit with this workaround.
>        float yGap = Math.abs(position.getTextPosition().getYDirAdj()
>        - lastPosition.getTextPosition().getYDirAdj());
>       
>        DecimalFormat df = new DecimalFormat("#.00");
>        float yGapTruncated = Float.valueOf(df.format(yGap));
>       
>        float newYVal = Float.valueOf(df.format(getDropThreshold()
>        * maxHeightForLine));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (PDFBOX-1874) PDFTextStripper.isParagraphSeparation(...)

Reply via email to