RE: FW: Good reading/research on PDF text extraction

Richard Braman Sun, 26 Feb 2006 15:49:51 -0800

Rakesh,
What developments have been done so far to enable nutch to parse PDFs?
Have you read through Tamir's Whitepaper?
Rich
 
 
 
PS. Here are some comments from Ben Litchfiled, developer of open source
PDF Box (java), followed by some comments from Tamir, who wrote the PDF
extraction algorithm :
Richard,

Are you saying you want to head this type of project up and are looking
for help or are you requesting this functionality be added to existing
projects?

I have worked on a couple different 'custom' text extraction projects
using PDFBox and need to organize those changes before I can commit them
to the PDFBox project. Right now they are very specific/custom so I need
to extract the generic parts out and make them part of the core PDFBox.
Just need to find the time to do it.

Certainly if Christian Leinberger has made some progress I would be
willing to work with him to add some features to the PDFBox core.

I agree that this is important functionality and requires more than just
simple text extraction but advanced AI concepts.

Ben

My response:

I am requesting this functionality be added to existing projects. I am
saying I am available to code, discuss, document, test, support, or
otherwise do whatever else I can do to get some good technology in the
public domain in this area.

>Certainly if Christian Leinberger has made some progress I would be 

>willing to work with him to add some features to the PDFBox core.

Hopefully they will get back to us all. I would like to see the results.

I would also like to ask Ben, et al if PDFBox supports reading of
"tagged" PDF, and if so in what classes? 

-Original Message-----

From: Tamir Hassan [ <mailto:[EMAIL PROTECTED]>
mailto:[EMAIL PROTECTED] 

Sent: Thursday, February 23, 2006 5:44 AM

To: [EMAIL PROTECTED]

Subject: Re: Do you still answer this email

Dear Richard,

Thanks for your email.

My current situation is that I am working for a project that has a 

commercial partner, who provides part of the funding. This is on the 

understanding that my code and developments will eventually be 

integrated with their existing commercial, non-open-source software.

So, because of this, it is not up to me to decide whether I can share 

some of my developments with the rest of the PDFBox community and with a

compatible licence. I did speak to one of my supervisors today, and he 

did not rule out the possibility, but this would also have to be OK'd 

with several higher members of my department.

I do believe that sharing some of my progress with the community could 

be mutually beneficial. Therefore, I will make a proposal to the people 

in charge of the project, and I will let you know of the outcome. This 

might, however take some time.

I will keep you updated.

Best regards,

Tamir

Richard Braman wrote:

> I read your final report, as well as Christians report on converting 

> PDF to XML. I am actullay quite interested in these developments, and 

> would be to contribute time to any projects you guys are undertaking. 

> I am working on a parallel effort to convert government documents into

> structured XML. I am very interested in the technology, and you guys 

> seem to have created some sophisticated contact extraction algorithms 

> to deal with columns, tables, ect.

> 

> Have a look at the attached PDF. It contains coumns, and text full of 

> valuable information , formatted in a very unstrucutred way. I tried 

> to run it through your code, but the file is comressed using Flate, 

> and the old jpedal couldn't understand the comression used. I tried 

> running your code on new Jpedal, and the interfaces and classes have 

> changed around greatly. He in fact moved the GenericGrouping class 

> into his non GPL enterprise lib, and changed the name of the class, as

> well as the return types. He also changed some off the class members 

> from public to private, and deleted others. All in all your code 

> would have to be entirely rewritten to use with current Jpedal which 

> is a shame.

> 

> Anyways, it seems like you are focusing on PDF Box, which has a better

> license, and developers committed to OS, instead of what Jpedal does 

> now, which is keep only some stuff in GPL, everything that is seeminly

> useful is now in the enterprise library. Are you able to share your 

> developments?

RE: FW: Good reading/research on PDF text extraction

Reply via email to