Hi Daniel, we do it in a bit different way - we have also a lof of OCR documents. The PDF format allows you to create two layer PDF - the first layer consists of the scanned text as a image (and is displayed to users) and the bellow layer can consist of the OCR text from the image. This solution has a lot of advantages - at least, the pdf.txt files are created by the DSpace and you do not to make any changes by hand.
I think we use for creating such PDFs two tools - FineReader and InftyReader. However, this is not my part of the project, so I am not sure if both are neccessary and what is the worflow. If you are interested in more details let me know and I will redirect you to the right persons :-). Have a nice day Vlastik ---------------------------------------------------------------------------- Vlastimil Krejčíř Library and Information Centre, Institute of Computer Science Masaryk University, Brno, Czech Republic Email: krejcir (at) ics (dot) muni (dot) cz Phone: +420 549 49 3872 ICQ: 163963217 Jabber: [email protected] ---------------------------------------------------------------------------- On Thu, 7 Mar 2013, Daniel Sifton wrote: > > Hi folks, > > > > We’ve uploaded a limited amount of OCR pdf documents. Were we to edit the > OCR bitstream (.pdf.text) does anyone have any advice on how to go about > getting out the bitstream and then getting it back in? Or perhaps I’m coming > at this from the wrong angle? > > > > > > Thanks, > > > > > > Dan > > > ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Dspace-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-general
