Hi Marc,

text and image extraction is one of the regular use cases. Keeping the 
formatting is also possible but there is a different concept behind the PDF 
format and text processing. E.g. what is a paragraph within a text processor 
might be individually placed characters (glyphs) within a PDF file. You might 
want to look into PDFStreamEngine and it’s subclasses how to process graphics 
and text information of a PDF.

Another sample is PDF2SVG which uses PDFBox 
[https://bitbucket.org/petermr/pdf2svg/wiki/Home]

BR

Maruan

Am 10.10.2014 um 14:36 schrieb Marc Davis <[email protected]>:

> Maruan,
> 
> We’ve been thinking of using PDFBox as a PDF to Doc/x converter - it this 
> tool ready for prime-time since the MS formats are such a pain to work with?  
> I would appreciate your thoughts.
> 
> Essentially, our objective is to extract text and image while retaining some 
> basic formatting. I think the challenge is in the latter.
> 
> Thanks,
> Marc
> 
> 
> 
> On Oct 10, 2014, at 4:43 AM, Maruan Sahyoun <[email protected]> wrote:
> 
>> Hi Jan,
>> 
>> choosing the right technology is very important so I do understand your 
>> concerns. I had to make such decision about using PDFBox in the past too. 
>> 
>> If you have specific issues I can answer I’m happy to try to do so. As a 
>> general statement PDFBox is used in production environments today (as an 
>> example we ourselves are using it for a banking customer to process account 
>> statements, an airline company to preprocess archiving documents and various 
>> other customers). 
>> 
>> PDFBox is continuously enhancing the parsing as we try to deal with real 
>> world PDF files which are not always inline with the the PDF specification. 
>> Currently the best approach is to use PDDocument.loadNonSeq (which parses 
>> documents according to the Xref information) and in case of an exception 
>> PDDocument.load (which parses sequentially). The Apache Tika project, which 
>> uses PDFBox for parsing PDF’s, is running the parsing and text extraction 
>> against 50k PDFs being made available via http://digitalcorpora.org
>> 
>> What is the application you would like to be using PDFBox for? Text 
>> Extraction, image conversion …. - I might be able to give you more specific 
>> information for your use case.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <[email protected]>:
>> 
>>> Thank you Maruan, this function loads document.
>>> 
>>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance PDF 
>>> parsing". I think correct parsing is very important, and I have some 
>>> doubts, if I can use pdfbox in production. Can you say something to rest me 
>>> :-).
>>> 
>>> Jan
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:[email protected]] 
>>> Sent: Friday, October 10, 2014 9:25 AM
>>> To: [email protected]
>>> Subject: Re: problem with pdf eof
>>> 
>>> Hi 
>>> 
>>> you can try PDDocument.loadNonSeq(InputStream is, null) 
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <[email protected]>:
>>> 
>>>> Hello,
>>>> I use PDFBox 1.8.7  PDDocument.load(InputStream is) method to parse PDF 
>>>> document in attachement.
>>>> Method return without exception, but document model is incomplete.
>>>> 
>>>> Problem is in characters after EOF (ofset 22939):
>>>> startxref
>>>> 22449
>>>> %%EOF
>>>> @
>>>> 16 0 obj
>>>> << 
>>>> /Type /Catalog
>>>> 
>>>> PDFBox create internal IOException and ignore it with comment:
>>>>                  /*
>>>>                   * PDF files may have random data after the EOF marker. 
>>>> Ignore errors if
>>>>                   * last object processed is EOF.
>>>>                   */
>>>> 
>>>> Is this PDF construction valid?
>>>> Which parser in PDFBox is correct? I tried ConformingPDParser, but another 
>>>> error occured.
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na 
>>>> uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak 
>>>> není, nelze je považovat za jednání, které by zakládalo jakékoliv nároky 
>>>> vůči společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci 
>>>> a dalším osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, 
>>>> včetně obsahu všech připojených souborů, je důvěrný. Jestliže nejste 
>>>> oprávněný příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, 
>>>> reprodukce, kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu 
>>>> všech připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte 
>>>> to, prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených 
>>>> souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP 
>>>> Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně 
>>>> pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů 
>>>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je 
>>>> daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita 
>>>> pracovních aktivit a byla umožněna jejich kontrola..
>>> 
>> 
> 

Reply via email to