Hi Marc, text and image extraction is one of the regular use cases. Keeping the formatting is also possible but there is a different concept behind the PDF format and text processing. E.g. what is a paragraph within a text processor might be individually placed characters (glyphs) within a PDF file. You might want to look into PDFStreamEngine and it’s subclasses how to process graphics and text information of a PDF.
Another sample is PDF2SVG which uses PDFBox [https://bitbucket.org/petermr/pdf2svg/wiki/Home] BR Maruan Am 10.10.2014 um 14:36 schrieb Marc Davis <[email protected]>: > Maruan, > > We’ve been thinking of using PDFBox as a PDF to Doc/x converter - it this > tool ready for prime-time since the MS formats are such a pain to work with? > I would appreciate your thoughts. > > Essentially, our objective is to extract text and image while retaining some > basic formatting. I think the challenge is in the latter. > > Thanks, > Marc > > > > On Oct 10, 2014, at 4:43 AM, Maruan Sahyoun <[email protected]> wrote: > >> Hi Jan, >> >> choosing the right technology is very important so I do understand your >> concerns. I had to make such decision about using PDFBox in the past too. >> >> If you have specific issues I can answer I’m happy to try to do so. As a >> general statement PDFBox is used in production environments today (as an >> example we ourselves are using it for a banking customer to process account >> statements, an airline company to preprocess archiving documents and various >> other customers). >> >> PDFBox is continuously enhancing the parsing as we try to deal with real >> world PDF files which are not always inline with the the PDF specification. >> Currently the best approach is to use PDDocument.loadNonSeq (which parses >> documents according to the Xref information) and in case of an exception >> PDDocument.load (which parses sequentially). The Apache Tika project, which >> uses PDFBox for parsing PDF’s, is running the parsing and text extraction >> against 50k PDFs being made available via http://digitalcorpora.org >> >> What is the application you would like to be using PDFBox for? Text >> Extraction, image conversion …. - I might be able to give you more specific >> information for your use case. >> >> BR >> >> Maruan >> >> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <[email protected]>: >> >>> Thank you Maruan, this function loads document. >>> >>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF >>> parsing". I think correct parsing is very important, and I have some >>> doubts, if I can use pdfbox in production. Can you say something to rest me >>> :-). >>> >>> Jan >>> >>> -----Original Message----- >>> From: Maruan Sahyoun [mailto:[email protected]] >>> Sent: Friday, October 10, 2014 9:25 AM >>> To: [email protected] >>> Subject: Re: problem with pdf eof >>> >>> Hi >>> >>> you can try PDDocument.loadNonSeq(InputStream is, null) >>> >>> BR >>> >>> Maruan >>> >>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <[email protected]>: >>> >>>> Hello, >>>> I use PDFBox 1.8.7 PDDocument.load(InputStream is) method to parse PDF >>>> document in attachement. >>>> Method return without exception, but document model is incomplete. >>>> >>>> Problem is in characters after EOF (ofset 22939): >>>> startxref >>>> 22449 >>>> %%EOF >>>> @ >>>> 16 0 obj >>>> << >>>> /Type /Catalog >>>> >>>> PDFBox create internal IOException and ignore it with comment: >>>> /* >>>> * PDF files may have random data after the EOF marker. >>>> Ignore errors if >>>> * last object processed is EOF. >>>> */ >>>> >>>> Is this PDF construction valid? >>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another >>>> error occured. >>>> >>>> Jan >>>> >>>> >>>> >>>> >>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na >>>> uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak >>>> není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky >>>> vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci >>>> a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, >>>> včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste >>>> oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, >>>> reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu >>>> všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte >>>> to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených >>>> souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP >>>> Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně >>>> pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů >>>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je >>>> daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita >>>> pracovních aktivit a byla umožněna jejich kontrola.. >>> >> >

