One, as yet undocumented, iDEA lab project at Edinburgh is to generate topic indexes for browsing relatively large collections (currently several thousands, planning for 10x - 100x that) of academic papers.
(See http://homepages.inf.ed.ac.uk/mfourman/research/topics/uoe.xml for an early test example. Best viewed with a WebKit browser [Safari, Chrome], but also with latest Firefox [with some UI features missing].) We're mining online pdf texts, and find that around one third of the pdfs that academics at Edinburgh publish online don't easily yield text. I have slightly different needs from someone wanting a text version for annotation (I just need a bag of words). I'm resorting to OCR, using a combination of convert (ImageMagick), tesseract (code.google.com/p/tesseract-ocr/), aspell, and a stemmer to produce the bag of words I need. The ocropus project, which also builds on tesseract, may be closer to what you want. (code.google.com/p/ocropus/) VelOCRaptor (http://blog.velocraptor.com/) provides an OSX tool (not open, but based on ocropus) for using ocr to add searchable text to pdfs. It would be good to establish an open version of something similar, together with tools for manual correction, and learning from manual corrections to improve automation. I plan to propose an MSc project along these lines. With best wishes for the New Year, Michael On 1 Jan 2010, at 12:00, [email protected] wrote: > On Fri, Dec 4, 2009 at 9:44 AM, Philippe Aigrain > <[email protected]> wrote: >> Does not fit your imemdiate needs of annotating PDF, but in our new version >> of the co-ment annotation system, we took a strong orientation of using >> simple structured text formats such as markdown. For PDFs containing text, >> it is relatively easy to go PDF to markdown. Of course for PDF containing >> images of texts, this is another story. >> >> See www.co-ment.net for existing co-ment >> www.co-ment.org for future version > Professor Michael Fourman FBCS CITP Director, iDEA lab Informatics Forum 10 Crichton Street Edinburgh EH8 9AB http://idea.ed.ac.uk/ For diary appointments contact : mdunlop2(at)ed-dot-ac-dot-uk +44 131 650 2690 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ okfn-discuss mailing list [email protected] http://lists.okfn.org/mailman/listinfo/okfn-discuss
