Am 29.09.2016 um 15:08 schrieb win harrington:
I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format. The items are indented. I wantthe text and the indentation level. The input is like this: - abc - def- xyz- ghi- 123- 456Can I convert that to:abc def xyz ghi 123 456 The last step will be toadd tags. I have code to do this: <abc></abc><def></def> <xyz></xyz> <ghi></ghi> <123></123> <456></456>
This sounds like an ordinary java question, i.e. parse some text. PDFBox does have some rudimentary paragraph detection, I don't know if it works. Try the PDFText2HTML tool in the source download.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

