[
https://issues.apache.org/jira/browse/PDFBOX-4054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-4054.
-----------------------------------
Resolution: Won't Do
> allow to access positions of text extracted by PDFTextStripper
> --------------------------------------------------------------
>
> Key: PDFBOX-4054
> URL: https://issues.apache.org/jira/browse/PDFBOX-4054
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 1.8.13
> Environment: any
> Reporter: Wolfgang Fahl
> Priority: Critical
>
> https://stackoverflow.com/questions/25109969/how-to-extract-a-paragraph-from-a-pdf-file-and-store-its-position/48119163?noredirect=1#comment83218312_48119163
> describes a need that pdftotext -bbox-layout fulfills by supplying structural
> information
> for the text extraction.
> There has been no PDFBox answer for a while so I assume such a feature is
> missing.
> A similar approach would be a useful improvement ot PDFBox and much wanted
> for certain applications - e.g. when the position of a text on a page is
> important for it's meaning.
> The poppler xhtml approach supplies for example:
> <flow>
> <block xMin="333.000000" yMin="270.150000" xMax="360.004000"
> yMax="275.150000">
> <line xMin="333.000000" yMin="270.150000" xMax="360.004000"
> yMax="275.150000">
> <word xMin="333.000000" yMin="270.150000" xMax="342.896500"
> yMax="275.150000">Your</word>
> <word xMin="347.047500" yMin="270.150000" xMax="360.004000"
> yMax="275.150000">Bank</word>
> </line>
> </block>
> </flow>
> flow/block/line/word is a hierachy and you get position information for block
> and line.
> PdfBox could supply similar information via callbacks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]