Am 29.09.2016 um 15:08 schrieb win harrington:
I would like to extract all the lists of bullet points from a PDF fileand put 
them into an xml format.
The items are indented. I wantthe text and the indentation level.
The input is like this:
    - abc
    - def
- xyz
    - ghi
- 123
    - 456


Can I convert that to:abc def   xyz   ghi      123      456
The last step will be toadd tags. I have code to do this:
<abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>        <123></123>
         <456></456>

This sounds like an ordinary java question, i.e. parse some text. PDFBox does have some rudimentary paragraph detection, I don't know if it works. Try the PDFText2HTML tool in the source download.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to