[
https://issues.apache.org/jira/browse/PDFBOX-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson closed PDFBOX-486.
------------------------------
Resolution: Won't Fix
A "word" is quite hard to define, especially for the many languages supported
by Unicode. They are not necessarily separated by whitespace.
In the latest trunk the position of each glyph is now easily accessible to
subclasses of PDFStreamEngine, and so the task of finding words, or other
language-specific constructs is best left to individual end-users.
> Position of each individual word
> --------------------------------
>
> Key: PDFBOX-486
> URL: https://issues.apache.org/jira/browse/PDFBOX-486
> Project: PDFBox
> Issue Type: Wish
> Components: Text extraction, Utilities
> Affects Versions: 0.8.0-incubator
> Reporter: matija kancijan
>
> Is it possible to extract possition of each word from te pdf?
> Similar to the PDFHighlighter class where output is xml file
> with page and possitions of the word.
> With this option you cold mark whole article and in addition
> produce your own xml file to select it in pdf file.
> When this could be also combined with PDFText2HTML class,
> you would have structure of the original pdf file and possition
> of the word, so the selection of articles would be much easier.
> This could be useful with bookmarks too.
> (I am new to the pdfbox, so if someone can put me in the right
> direction i would gladly do this... ;) )
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)