One, as yet undocumented, iDEA lab project at Edinburgh is to generate topic 
indexes for browsing relatively large collections (currently several thousands, 
planning for 10x - 100x that) of academic papers. 

(See http://homepages.inf.ed.ac.uk/mfourman/research/topics/uoe.xml for an 
early test example. Best viewed with a WebKit browser [Safari, Chrome], but 
also with latest Firefox [with some UI features missing].)

We're mining online pdf texts, and find that around one third of the pdfs that 
academics at Edinburgh publish online don't easily yield text.

I have slightly different needs from someone wanting a text version for 
annotation (I just need a bag of words). I'm resorting to OCR, using a 
combination of convert (ImageMagick), tesseract 
(code.google.com/p/tesseract-ocr/), aspell, and a stemmer to produce the bag of 
words I need.
 
The ocropus project, which also builds on tesseract, may be closer to what you 
want. (code.google.com/p/ocropus/)

VelOCRaptor (http://blog.velocraptor.com/) provides an OSX tool (not open, but 
based on ocropus) for using ocr to add searchable text to pdfs.

It would be good to establish an open version of something similar, together 
with tools for manual correction, and learning from manual corrections to 
improve automation. I plan to propose an MSc project along these lines.

With best wishes for the New Year,

Michael

On 1 Jan 2010, at 12:00, [email protected] wrote:

> On Fri, Dec 4, 2009 at 9:44 AM, Philippe Aigrain
> <[email protected]> wrote:
>> Does not fit your imemdiate needs of annotating PDF, but in our new version
>> of the co-ment annotation system, we took a strong orientation of using
>> simple structured text formats such as markdown. For PDFs containing text,
>> it is relatively easy to go PDF to markdown. Of course for PDF containing
>> images of texts, this is another story.
>> 
>> See www.co-ment.net for existing co-ment
>> www.co-ment.org for future version
> 

Professor Michael Fourman FBCS CITP
Director, iDEA lab
Informatics Forum
10 Crichton Street
Edinburgh
EH8 9AB 
http://idea.ed.ac.uk/
For diary appointments contact :
mdunlop2(at)ed-dot-ac-dot-uk
+44 131 650 2690

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.


_______________________________________________
okfn-discuss mailing list
[email protected]
http://lists.okfn.org/mailman/listinfo/okfn-discuss

Reply via email to