[iText-questions] RE: [PDFBox-user] Good reading/resarch on PDF text extraction

Richard Braman Tue, 21 Feb 2006 12:36:07 -0800

>Are you saying you want to head this type of project up and are looking
for help or are you requesting this functionality be added to 
>existing projects?


I am requesting this functionality be added to existing projects.  I am
saying I am available to code, discuss, document, test, support, or
otherwise do whatever else I can do to get some good technology in the
public domain in this area.

>Certainly if Christian Leinberger has made some progress I would be
willing to work with him to add some features to the PDFBox core.

Hopefully they will get back to us all.  I would like to see the
results.


I would also like to ask Ben, et al if PDFBox supports reading of
"tagged" PDF, and if so in what classes?  Example code would also be
helpful for me to learn more about tagging.



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ben
Litchfield
Sent: Tuesday, February 21, 2006 2:27 PM
To: Richard Braman
Cc: [email protected]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [PDFBox-user] Good reading/resarch on PDF text extraction



Richard,

Are you saying you want to head this type of project up and are looking
for help or are you requesting this functionality be added to existing
projects?

I have worked on a couple different 'custom' text extraction projects
using PDFBox and need to organize those changes before I can commit them
to the PDFBox project.  Right now they are very specific/custom so I
need to extract the generic parts out and make them part of the core
PDFBox. Just need to find the time to do it.

Certainly if Christian Leinberger has made some progress I would be
willing to work with him to add some features to the PDFBox core.

I agree that this is important functionality and requires more than just
simple text extraction but advanced AI concepts.

Ben



On Tue, 21 Feb 2006, Richard Braman wrote:

> In 2003, Tamir Hassan wrote a OS program  
> <http://www.tamirhassan.com/> http://www.tamirhassan.com/ to extract 
> text out of PDF tables and columns and put it into HTML as a part of a

> University research product. His algorthims were actually quite 
> sophisticated and well documented in 
> http://www.tamirhassan.dsl.pipex.com/final.pdf.
>
> The results were actually quite impressive, as he managed to deal with

> columns, etc using what he referred to Intelligent text extraction 
> algorithm which uses positions to preserve text flow.  He used Jpedal 
> as his underlying PDF library.
>
> Unfortunately his program was written with an old version of Jpedal 
> and does not run with the new Jpedal.  This is due to the fact that 
> the PDFGenericGrouping class he used was changed to 
> PDFGroupingAlgorithms and moved to non-GPL Jpedal.  The new class also

> changed some of the old classes' members from public to private, and 
> deleted some members, which would make rewriting his app nessesary.
>
> Fast forward to 2005, Christian Leinberger, a colleague of Tamirs, 
> writes a paper entitled Ideas for extracting data from an unstructured

> document 
> http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_da
> ta
> _from_unstructured_documents.pdf.  Christian indicated that he is
using
> PDFBox as his library for experiementing with algortihms that can be
> used to extract text reliabily out of unstructured PDFs.
>
> I have contacted these guys and hopefully they will be willing to 
> share their developments with the PDF community.
>
> As more and more content gets "pushed" into PDF it looses its meaning 
> to anyone else other than a human reader or a printer.  Machines do 
> not have the ability to read and parse it reliably in a generic 
> context, and it requires sophisticated AI algortihms based on 
> ontologies, or  other big words, to get it out.  If your lucky, you 
> can hack through it and get what you need. Something to think about 
> the next time you push content into a PDF, or even HTML.  PDF is a 
> great way to present content for priting, but it sucks, pardon my 
> french, as a primary mechanism for presenting data that may need to be

> used by a machine somewhere downstream.
>
> Getting it out has turned into big business for companies such as 
> Texcel
> (www.texcel.com) , Cambridge Docs, and others who have developed
> technology to get into the PDF and get important data out of it and
into
> another format, usually XML.  This is a growing space and I hope that
> there are some more developers interested in solving the problem
created
> by PDF crazy folks who have managed to shove valuable data into PDF
> while failing to maintain that same data in another more usable format
> (e.g. XML).  It is best that this is done in an open format, because
the
> value of such technolgy is very high, it is complicated to produce,
and
> very useful to the general public.
>
> Richard Braman
>  <mailto:[EMAIL PROTECTED]> 
> mailto:[EMAIL PROTECTED]
> 561.748.4002 (voice)
>
>  <http://www.taxcodesoftware.org/> http://www.taxcodesoftware.org Free

> Open Source Tax Software
>
>
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files for problems?  Stop!  Download the new AJAX search engine that
makes searching your log files as easy as surfing the  web.  DOWNLOAD
SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
PDFBox-user mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/pdfbox-user



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

[iText-questions] RE: [PDFBox-user] Good reading/resarch on PDF text extraction

Reply via email to