[CODE4LIB] OCR PDFs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I wonder if any of you might have experience with creating text PDFs from TIFFs. I've been using tiffcp to stitch TIFFs together into a single image and then using tiff2pdf to generate PDFs from the single TIFF. I've had to pass this image-based PDF to someone with Acrobat to use it's batch processing facility to OCR the text and save a text-based PDF. I wonder if anyone has suggestions for software I can integrate into the script (Python on Linux) I'm using. Thanks, James - -- - --- James Tuttle Digital Repository Librarian NCSU Libraries, Box 7111 North Carolina State University Raleigh, NC 27695-7111 [EMAIL PROTECTED] (919)513-0651 Phone (919)515-3031 Fax -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI+H1zKxpLzx+LOWMRAgxIAJwNXyeMJbk6r6hmHpNAdEvWIQbCVgCgp8JR nyS3WZ4UuRbU/6DTH7ohe/M= =mT2T -END PGP SIGNATURE-
Re: [CODE4LIB] OCR PDFs
You might want to look at ABBYY Fine Reader 9.0 Professional, which can be driven from the command line. Fine Reader is used at the Library of Congress. Here is a info link to get you started (search command): http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Software/Nuance/omnipage_review.asp Regards, Terry Terry Harrison Project Manager CACI 5505 Robin Hood Road, Suite F Norfolk, Va. 23508 Ph: 757.321.9120 x232 Fax: 757.321.8797 [EMAIL PROTECTED]
Re: [CODE4LIB] OCR PDFs
If you haven't already, take a look at tesseract ( http://code.google.com/p/tesseract-ocr/). There's some discussion of using tesseract and shell scripting to work with tiffs to pdfs to ocr'd text, which isn't exactly what you're wanting to do, I know, but may prove helpful (http://www.groklaw.net/articlebasic.php?story=20061210115516438). Cheers! Bridger Dyson-Smith On Fri, Oct 17, 2008 at 8:28 AM, Terry Harrison [EMAIL PROTECTED] wrote: You might want to look at ABBYY Fine Reader 9.0 Professional, which can be driven from the command line. Fine Reader is used at the Library of Congress. Here is a info link to get you started (search command): http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Software/Nuance/omnipage_review.asp Regards, Terry Terry Harrison Project Manager CACI 5505 Robin Hood Road, Suite F Norfolk, Va. 23508 Ph: 757.321.9120 x232 Fax: 757.321.8797 [EMAIL PROTECTED]
Re: [CODE4LIB] OCR PDFs
This is somewhat off-topic, since you asked for something you can use on Linux. In any case... I've been using OmniPage 16, and I'm sorry to say I can't recommend it. You can't run it from the command line, so you can't really integrate it into a script. It does have a batch manager, so you can set it to do whole folders at a time. Just make sure your folder's not too large; it crashes fairly reliably after about 10-40 pages. If you do use OmniPage to make your PDFs, I've found that it works best to convert a single TIFF into a single-page PDF, then use pdftk[1] (along with a [language of your choice] script) to put those PDFs together however you want them. Have a nice day, Jonathan [1] http://www.accesspdf.com/pdftk/ -- Jonathan M. Brinley Metadata Digital Initiatives Developer Ball State University [EMAIL PROTECTED] http://xplus3.net/ On Fri, Oct 17, 2008 at 7:56 AM, James Tuttle [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I wonder if any of you might have experience with creating text PDFs from TIFFs. I've been using tiffcp to stitch TIFFs together into a single image and then using tiff2pdf to generate PDFs from the single TIFF. I've had to pass this image-based PDF to someone with Acrobat to use it's batch processing facility to OCR the text and save a text-based PDF. I wonder if anyone has suggestions for software I can integrate into the script (Python on Linux) I'm using. Thanks, James - -- - --- James Tuttle Digital Repository Librarian NCSU Libraries, Box 7111 North Carolina State University Raleigh, NC 27695-7111 [EMAIL PROTECTED] (919)513-0651 Phone (919)515-3031 Fax -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI+H1zKxpLzx+LOWMRAgxIAJwNXyeMJbk6r6hmHpNAdEvWIQbCVgCgp8JR nyS3WZ4UuRbU/6DTH7ohe/M= =mT2T -END PGP SIGNATURE-
[CODE4LIB] eXtensible Catalog - New Website
***Cross-posted; apologies for duplication*** The eXtensible Catalog Project is pleased to announce that we have launched our new website at http://www.extensiblecatalog.org/. This new website will be the main vehicle for distributing our open-source software once it is released in 2009. In the mean time, the website contains a wealth of information regarding the project, including publications, an overview of the software we are developing and the technologies that software will use, and a blog that has already been in use. The eXtensible Catalog (XC) Project is working to design and develop a set of open-source applications that will provide libraries with an alternative way to reveal their collections to library users. XC will provide easy access to all resources (both digital and physical collections) across a variety of databases, metadata schemas and standards, and will enable library content to be revealed through other services that libraries may already be using, such as content management systems and learning management systems. XC will also make library collections more web-accessible by revealing them through web search engines. Since XC software will be open source, it will be available for download at no cost. Libraries will be able to adopt, customize and extend the software to meet local needs. In addition, a not-for-profit organization will be formed to provide the infrastructure to incorporate community contributions to the code base, encourage collaboration, and provide maintenance and upgrades. The project is hosted at the University of Rochester and funded through a generous grant from the Andrew W. Mellon Foundation Scholarly Communications Program as well as through significant contributions from and in collaboration with XC partner institutions. The project is in a design and development phase until July 2009, at which point the software will be released under an open-source license. Steven Dibelius Deployment Engineer, eXtensible Catalog Project University of Rochester [EMAIL PROTECTED]
Re: [CODE4LIB] registry of databases
Hello all, My name is Joanna White and I am the Product Manager for the WorldCat Registry. The WorldCat Registry is a directory of libraries and services they provide. Through a secure webtool, libraries can manage and share information about their institutional identity, and makes institutional metadata available to both OCLC and non-OCLC services. Currently, the WorldCat Registry does not include the type of database information Stephen mentioned in his original message. However, we are always interested in the community's ideas and needs. We follow lists like [CODE4LIB] and you can also send ideas to our mailbox at registries at oclc dot org You can follow the WorldCat Registry's developments via the OCLC Newsletter or on the DevNet Blog at http://worldcat.org/devnet/blog/ You can also learn more about our API offerings under WorldCat Registry Search, WorldCat Registry Detail and OpenURL Getaway here http://www.worldcat.org/wcpa/content/affiliate/default.jsp. Thank you, Joanna White OCLC WorldCat Registry, http://worldcat.org/registry/institutions mailto: Whitej at oclc dot org -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Stephen Francoeur Sent: Thursday, October 16, 2008 2:49 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] registry of databases Despite my best efforts to save things to delicious that catch my eye, I can't seem to find an item that I know I read in the past two weeks. Someone mentioned an effort to create a registry of databases in which you could see what libraries had subscribed to which database. Is there such a project or is this a figment of my fevered imagination? I know I'm not thinking of how some libraries include databases in their catalogs, which then gets passed on to WorldCat if the library is an OCLC member. What I recall reading, though, may have made some reference to the WorldCat Registry (http://www.worldcat.org/registry/Institutions). Any help here? Stephen Francoeur Information Services Librarian Newman Library Baruch College 151 E. 25th Street New York, NY 10010 http://www.retaggr.com/Card/stephenfrancoeur
Re: [CODE4LIB] Vote for NE code4lib meetup location
I joined myself to the group just today, too late to vote, but what I see is 23 votes for Boston and 43 for anywhere else. Shouldn't there at least be a runoff? -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jay Luker Sent: Wednesday, October 15, 2008 4:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Vote for NE code4lib meetup location Sorry to leave you all in suspense all day. The results are in: 23 Boston, MA 18 Northampton, MA 14 Concord, NH 11 Portland, ME Michael Klein has said he will now check when a suitable space will be available at BPL. Then we'll update the WhenIsGood page and hope for some availability intersection goodness. --jay
[CODE4LIB] Job posting: Analyst Programmer Intermediate - Georgia State University Library
Vacancy Number: 0600774 Position Title: Analyst Programmer Intermediate Type of Position: Regular Staff Department: Library Duties: Reporting to the Web Development Librarian, the Analyst Programmer develops, maintains, and troubleshoots web based applications in support of the University Library's goals. Responsibilities include scripting and programming applications developed in-house, customization and enhancement of open-source and vendor applications, working with vendor or open-source Application Programming Interfaces (APIs), and management of in-house databases. The position works with project stakeholders as needed to further develop or enhance application design for scheduled and prioritized projects. The Analyst Programmer works collaboratively with library Systems personnel to implement and configure web servers in support of web development activities, authentication technologies and server security. Minimum Qualifications: Bachelor's degree and two years of related experience; or a combination of education and experience. Preferred Qualifications: Bachelor's degree in Computer Science or a related field and three years of related experience. Working knowledge of programming/scripting web applications in languages such as PHP, PERL, and Javascript. Experience working in a Linux/Unix environment and working with the Apache web server. Posting Date: 09-29-2008 Closing Date: Open Until Filled Special Instructions to Applicants: An application, resume and cover letter are required for consideration. An offer of employment will be conditional on background verification. Apply online: https://jobs.gsu.edu/applicants/jsp/shared/frameset/Frameset.jsp?time=1224255051703 Job posting link: http://www.library.gsu.edu/jobs/ Doug Goans Web Development Librarian Georgia State University Library 100 Decatur St. Atlanta. GA 30303 Tel: (404) 413 2772 Fax: (404) 651-4315
[CODE4LIB] FW: NAF notification service from OCLC
FYI: note below sent out to Karen Calhoun in the [EMAIL PROTECTED] = 'OCLC would be required to work with the Library of Congress as the producer of the NAF data before OCLC could create the NAF notification service' Greetings Karen, Per Roy's statement at the top, I have received several questions and forward them to you. Provided that NACO contributors from participating libraries produce (create or modify) most of the name authority records listed in the NAF updates. They do that during their work hours at their respective institutions, work hours paid for by those respective institutions. The Library of Congress has evidently a role in promulgating these records in the NAF updates, however: 1. Why do libraries interested in NAF updates have to pay for these updates? 2. Why isn't the work of NACO contributors recompensed by allowing them to access, at the least, a notification of NAF of which they have contributed? 3. What is the role of OCLC in these processes? 4. Does OCLC pay for NAF? if not, could CODE4LIB obtain NAF and NAF updates on a similar basis? Kind thanks for your attention and forthcoming replies, Ya'aqov Ziso [EMAIL PROTECTED] Dear Ya'aqov Ziso, Your email request/proposal of 4 October 2008 to Roy Tennant (My proposal to you is that OCLC will start offering a NEW service to its members/subscribers. That service will be a simple listing of the 010 fields for Name authority records that have been CHANGED that week in the OCLC NAF, and 010 for the new Name authority records for that have been ADDED to NAF.) has been referred by OCLC Research to the OCLC Metadata Services product group for consideration. We are pleased to receive your suggestion for a new service. We will add this suggestion to our list of potential new services and enhancements for consideration in our next round of planning for development in fiscal year 2010. Thank you for sharing your ideas with us. Karen Karen Calhoun Vice President, WorldCat and Metadata Services 6565 Kilgour Place Dublin OH 43017 800-848-5878 x6441 614-764-6441 FAX: 614-718-7457 [EMAIL PROTECTED] Address 6565 Kilgour Place Dublin OH 43017 Right click for SmartMenu shortcuts -- End of Forwarded Message
Re: [CODE4LIB] OCR PDFs
And beyond Tesseract is Ocropus (http://code.google.com/p/ocropus/), which uses Tesseract (and eventually other ocr engines) to generate positional OCR in an HTML format. I wonder if you could process that HTML slightly to put the TIFF in the background, then use an HTML to PDF tool to generate your final PDF. Or something like that. Googling ocropus pdf finds a few projects and discussions that might be helpful. Peter -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Bridger Dyson-Smith Sent: Friday, October 17, 2008 6:56 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] OCR PDFs If you haven't already, take a look at tesseract ( http://code.google.com/p/tesseract-ocr/). There's some discussion of using tesseract and shell scripting to work with tiffs to pdfs to ocr'd text, which isn't exactly what you're wanting to do, I know, but may prove helpful (http://www.groklaw.net/articlebasic.php?story=20061210115516438). Cheers! Bridger Dyson-Smith On Fri, Oct 17, 2008 at 8:28 AM, Terry Harrison [EMAIL PROTECTED] wrote: You might want to look at ABBYY Fine Reader 9.0 Professional, which can be driven from the command line. Fine Reader is used at the Library of Congress. Here is a info link to get you started (search command): http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Softwa re/Nuance/omnipage_review.asp Regards, Terry Terry Harrison Project Manager CACI 5505 Robin Hood Road, Suite F Norfolk, Va. 23508 Ph: 757.321.9120 x232 Fax: 757.321.8797 [EMAIL PROTECTED]
[CODE4LIB] Fwd: Please disseminate - Release of Version 1.0 Production OAI Object Reuse and Exchange Specifications
Forwarded on behalf of Carl Lagoze and the OAI-ORE authoring team... Begin forwarded message: From: Carl Lagoze [EMAIL PROTECTED] Date: October 17, 2008 4:02:14 PM EDT To: Tim DiLauro [EMAIL PROTECTED] Subject: Please disseminate - Release of Version 1.0 Production OAI Object Reuse and Exchange Specifications (The full copy of this Press Release is at http://www.openarchives.org/documents/ore-production-press-release.pdf ) Over the past two years the Open Archives Initiative (OAI), in a project called Object Reuse and Exchange (OAI-ORE), has gathered international experts from the publishing, web, library, repository, and eScience communities to develop standards for the identification and description of aggregations of Web resources. These standards provide the foundation for applications and services that can visualize, preserve, transfer, summarize, and improve access to the aggregations that people use in their daily Web interaction: including multiple page Web documents, multiple format documents in institutional repositories, scholarly data sets, and online photo and music collections. The OAI-ORE standards leverage the core Web architecture and concepts emerging from related efforts including the semantic web, linked data, and Atom syndication. As a result, they integrate both with the emerging machine-readable web, Web 2.0, and the future evolution of networked information. The production versions of the OAI-ORE specifications and implementation documents are now available to the public, with a table of contents page at http://www.openarchives.org/ore/toc. This public release is the culmination of several months of testing and review of initial alpha and beta releases. The participation and feedback from the wider OAI-ORE community, especially the OAI-ORE technical committee, was instrumental to the process leading up to this production release. The documents in the release describe a data model to introduce aggregations as resources with URIs on the web. They also detail the machine-readable descriptions of aggregations expressed in the popular Atom syndication format, in RDF/XML, and RDFa. The documents included in the release are: · ORE User Guide Documents o Primer o Resource Map Implementation in Atom o Resource Map Implementation in RDF/XML o Resource Map Implementation in RDFa o HTTP Implementation o Resource Map Discovery · ORE Specification Documents o Abstract Data Model o Vocabulary · Tools and Additional Resources Carl Lagoze - Cornell University - [EMAIL PROTECTED] Herbert Van de Sompel - Los Alamos National Laboratory - [EMAIL PROTECTED]
Re: [CODE4LIB] FW: NAF notification service from OCLC
Ya'aqov, Why don't you consider contacting the NACO program at the Library of Congress? They would be more equipped to answer your questions. Mark Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] eXtensible Catalog - New Website
Same for me on FF3. Also, the same error on IE 7 and Safari 3 for Windows. All browsers are identified as IE 6. Windows XP SP 2. --- David Cloutman [EMAIL PROTECTED] Electronic Services Librarian Marin County Free Library -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Mark A. Matienzo Sent: Friday, October 17, 2008 1:11 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] eXtensible Catalog - New Website I'm using Firefox 3 on OS X and the project's website is claiming I'm using IE 6 on Windows XP and thus not letting me access the site. Fix this, please? Mark Matienzo Applications Developer, Digital Experience Group The New York Public Library On Fri, Oct 17, 2008 at 10:31 AM, Dibelius, Steven [EMAIL PROTECTED] wrote: ***Cross-posted; apologies for duplication*** The eXtensible Catalog Project is pleased to announce that we have launched our new website at http://www.extensiblecatalog.org/. This new website will be the main vehicle for distributing our open-source software once it is released in 2009. In the mean time, the website contains a wealth of information regarding the project, including publications, an overview of the software we are developing and the technologies that software will use, and a blog that has already been in use. The eXtensible Catalog (XC) Project is working to design and develop a set of open-source applications that will provide libraries with an alternative way to reveal their collections to library users. XC will provide easy access to all resources (both digital and physical collections) across a variety of databases, metadata schemas and standards, and will enable library content to be revealed through other services that libraries may already be using, such as content management systems and learning management systems. XC will also make library collections more web-accessible by revealing them through web search engines. Since XC software will be open source, it will be available for download at no cost. Libraries will be able to adopt, customize and extend the software to meet local needs. In addition, a not-for-profit organization will be formed to provide the infrastructure to incorporate community contributions to the code base, encourage collaboration, and provide maintenance and upgrades. The project is hosted at the University of Rochester and funded through a generous grant from the Andrew W. Mellon Foundation Scholarly Communications Program as well as through significant contributions from and in collaboration with XC partner institutions. The project is in a design and development phase until July 2009, at which point the software will be released under an open-source license. Steven Dibelius Deployment Engineer, eXtensible Catalog Project University of Rochester [EMAIL PROTECTED] Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm
Re: [CODE4LIB] eXtensible Catalog - New Website
I'm having the same problem with Safari 3.1.1 on OS X, which the site thinks is also IE 6 on Windows XP. I haven't encountered this problem in years! -- Brenda Chawner Senior Lecturer LIM Programmes Director School of Information Management Victoria University of Wellington P O Box 600, Wellington NEW ZEALAND (04) 463 5780 | fax (04) 463 5446 | Room EA201 | [EMAIL PROTECTED] -Original Message- From: Code for Libraries on behalf of Mark A. Matienzo Sent: Sat 18-Oct-08 9:11 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] eXtensible Catalog - New Website I'm using Firefox 3 on OS X and the project's website is claiming I'm using IE 6 on Windows XP and thus not letting me access the site. Fix this, please? Mark Matienzo Applications Developer, Digital Experience Group The New York Public Library On Fri, Oct 17, 2008 at 10:31 AM, Dibelius, Steven [EMAIL PROTECTED] wrote: ***Cross-posted; apologies for duplication*** The eXtensible Catalog Project is pleased to announce that we have launched our new website at http://www.extensiblecatalog.org/. This new website will be the main vehicle for distributing our open-source software once it is released in 2009. In the mean time, the website contains a wealth of information regarding the project, including publications, an overview of the software we are developing and the technologies that software will use, and a blog that has already been in use. The eXtensible Catalog (XC) Project is working to design and develop a set of open-source applications that will provide libraries with an alternative way to reveal their collections to library users. XC will provide easy access to all resources (both digital and physical collections) across a variety of databases, metadata schemas and standards, and will enable library content to be revealed through other services that libraries may already be using, such as content management systems and learning management systems. XC will also make library collections more web-accessible by revealing them through web search engines. Since XC software will be open source, it will be available for download at no cost. Libraries will be able to adopt, customize and extend the software to meet local needs. In addition, a not-for-profit organization will be formed to provide the infrastructure to incorporate community contributions to the code base, encourage collaboration, and provide maintenance and upgrades. The project is hosted at the University of Rochester and funded through a generous grant from the Andrew W. Mellon Foundation Scholarly Communications Program as well as through significant contributions from and in collaboration with XC partner institutions. The project is in a design and development phase until July 2009, at which point the software will be released under an open-source license. Steven Dibelius Deployment Engineer, eXtensible Catalog Project University of Rochester [EMAIL PROTECTED]
Re: [CODE4LIB] eXtensible Catalog - New Website
I used Internet Explorer 7 to go this website, and I get the message You are using *Internet Explorer* version *6.0* on *Windows XP* -Chris Alhambra On Fri, Oct 17, 2008 at 4:11 PM, Mark A. Matienzo [EMAIL PROTECTED] wrote: I'm using Firefox 3 on OS X and the project's website is claiming I'm using IE 6 on Windows XP and thus not letting me access the site. Fix this, please? Mark Matienzo Applications Developer, Digital Experience Group The New York Public Library On Fri, Oct 17, 2008 at 10:31 AM, Dibelius, Steven [EMAIL PROTECTED] wrote: ***Cross-posted; apologies for duplication*** The eXtensible Catalog Project is pleased to announce that we have launched our new website at http://www.extensiblecatalog.org/. This new website will be the main vehicle for distributing our open-source software once it is released in 2009. In the mean time, the website contains a wealth of information regarding the project, including publications, an overview of the software we are developing and the technologies that software will use, and a blog that has already been in use. The eXtensible Catalog (XC) Project is working to design and develop a set of open-source applications that will provide libraries with an alternative way to reveal their collections to library users. XC will provide easy access to all resources (both digital and physical collections) across a variety of databases, metadata schemas and standards, and will enable library content to be revealed through other services that libraries may already be using, such as content management systems and learning management systems. XC will also make library collections more web-accessible by revealing them through web search engines. Since XC software will be open source, it will be available for download at no cost. Libraries will be able to adopt, customize and extend the software to meet local needs. In addition, a not-for-profit organization will be formed to provide the infrastructure to incorporate community contributions to the code base, encourage collaboration, and provide maintenance and upgrades. The project is hosted at the University of Rochester and funded through a generous grant from the Andrew W. Mellon Foundation Scholarly Communications Program as well as through significant contributions from and in collaboration with XC partner institutions. The project is in a design and development phase until July 2009, at which point the software will be released under an open-source license. Steven Dibelius Deployment Engineer, eXtensible Catalog Project University of Rochester [EMAIL PROTECTED]
Re: [CODE4LIB] eXtensible Catalog - New Website
I'm running FF3 on Ubuntu. No dice. Tried it in Opera 9.x in Ubuntu. Still doesn't work. On Fri, Oct 17, 2008 at 4:17 PM, Chris Alhambra [EMAIL PROTECTED] wrote: I used Internet Explorer 7 to go this website, and I get the message You are using *Internet Explorer* version *6.0* on *Windows XP* -Chris Alhambra On Fri, Oct 17, 2008 at 4:11 PM, Mark A. Matienzo [EMAIL PROTECTED] wrote: I'm using Firefox 3 on OS X and the project's website is claiming I'm using IE 6 on Windows XP and thus not letting me access the site. Fix this, please? Mark Matienzo Applications Developer, Digital Experience Group The New York Public Library On Fri, Oct 17, 2008 at 10:31 AM, Dibelius, Steven [EMAIL PROTECTED] wrote: ***Cross-posted; apologies for duplication*** The eXtensible Catalog Project is pleased to announce that we have launched our new website at http://www.extensiblecatalog.org/. This new website will be the main vehicle for distributing our open-source software once it is released in 2009. In the mean time, the website contains a wealth of information regarding the project, including publications, an overview of the software we are developing and the technologies that software will use, and a blog that has already been in use. The eXtensible Catalog (XC) Project is working to design and develop a set of open-source applications that will provide libraries with an alternative way to reveal their collections to library users. XC will provide easy access to all resources (both digital and physical collections) across a variety of databases, metadata schemas and standards, and will enable library content to be revealed through other services that libraries may already be using, such as content management systems and learning management systems. XC will also make library collections more web-accessible by revealing them through web search engines. Since XC software will be open source, it will be available for download at no cost. Libraries will be able to adopt, customize and extend the software to meet local needs. In addition, a not-for-profit organization will be formed to provide the infrastructure to incorporate community contributions to the code base, encourage collaboration, and provide maintenance and upgrades. The project is hosted at the University of Rochester and funded through a generous grant from the Andrew W. Mellon Foundation Scholarly Communications Program as well as through significant contributions from and in collaboration with XC partner institutions. The project is in a design and development phase until July 2009, at which point the software will be released under an open-source license. Steven Dibelius Deployment Engineer, eXtensible Catalog Project University of Rochester [EMAIL PROTECTED]
Re: [CODE4LIB] eXtensible Catalog - New Website
The site was working fine earlier, as I was able to view it with Opera (now, of course, I've the same problems). For the time being, this should get you there: http://www.extensiblecatalog.org/node/59 -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Chris Alhambra Sent: Friday, October 17, 2008 4:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] eXtensible Catalog - New Website I used Internet Explorer 7 to go this website, and I get the message You are using *Internet Explorer* version *6.0* on *Windows XP* -Chris Alhambra On Fri, Oct 17, 2008 at 4:11 PM, Mark A. Matienzo [EMAIL PROTECTED] wrote: I'm using Firefox 3 on OS X and the project's website is claiming I'm using IE 6 on Windows XP and thus not letting me access the site. Fix this, please? Mark Matienzo Applications Developer, Digital Experience Group The New York Public Library On Fri, Oct 17, 2008 at 10:31 AM, Dibelius, Steven [EMAIL PROTECTED] wrote: ***Cross-posted; apologies for duplication*** The eXtensible Catalog Project is pleased to announce that we have launched our new website at http://www.extensiblecatalog.org/. This new website will be the main vehicle for distributing our open-source software once it is released in 2009. In the mean time, the website contains a wealth of information regarding the project, including publications, an overview of the software we are developing and the technologies that software will use, and a blog that has already been in use. The eXtensible Catalog (XC) Project is working to design and develop a set of open-source applications that will provide libraries with an alternative way to reveal their collections to library users. XC will provide easy access to all resources (both digital and physical collections) across a variety of databases, metadata schemas and standards, and will enable library content to be revealed through other services that libraries may already be using, such as content management systems and learning management systems. XC will also make library collections more web-accessible by revealing them through web search engines. Since XC software will be open source, it will be available for download at no cost. Libraries will be able to adopt, customize and extend the software to meet local needs. In addition, a not-for-profit organization will be formed to provide the infrastructure to incorporate community contributions to the code base, encourage collaboration, and provide maintenance and upgrades. The project is hosted at the University of Rochester and funded through a generous grant from the Andrew W. Mellon Foundation Scholarly Communications Program as well as through significant contributions from and in collaboration with XC partner institutions. The project is in a design and development phase until July 2009, at which point the software will be released under an open-source license. Steven Dibelius Deployment Engineer, eXtensible Catalog Project University of Rochester [EMAIL PROTECTED]
Re: [CODE4LIB] OCR PDFs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Yes, I've tried tesseract and found it to be pretty accurate, but I don't believe there is a way to integrate the text back into the PDF. It's easy to pull text out of image-based PDFs, but not to put the text back in. Driving me crazy... Thanks for tips, James Bridger Dyson-Smith wrote: If you haven't already, take a look at tesseract ( http://code.google.com/p/tesseract-ocr/). There's some discussion of using tesseract and shell scripting to work with tiffs to pdfs to ocr'd text, which isn't exactly what you're wanting to do, I know, but may prove helpful (http://www.groklaw.net/articlebasic.php?story=20061210115516438). Cheers! Bridger Dyson-Smith On Fri, Oct 17, 2008 at 8:28 AM, Terry Harrison [EMAIL PROTECTED] wrote: You might want to look at ABBYY Fine Reader 9.0 Professional, which can be driven from the command line. Fine Reader is used at the Library of Congress. Here is a info link to get you started (search command): http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Software/Nuance/omnipage_review.asp Regards, Terry Terry Harrison Project Manager CACI 5505 Robin Hood Road, Suite F Norfolk, Va. 23508 Ph: 757.321.9120 x232 Fax: 757.321.8797 [EMAIL PROTECTED] - -- - --- James Tuttle Digital Repository Librarian NCSU Libraries, Box 7111 North Carolina State University Raleigh, NC 27695-7111 [EMAIL PROTECTED] (919)513-0651 Phone (919)515-3031 Fax -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI+QuEKxpLzx+LOWMRAhSyAJ9+lQ/1J5SP/23XQrVrlsoNRZyKxQCfYTGw qUBK6A9mkiLy88buUz7Wngg= =DyZk -END PGP SIGNATURE-
Re: [CODE4LIB] OCR PDFs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thanks for the tip. Especially the part where you make it clear that OmniPage doesn't really work. Back to Acrobat, I guess. Thanks all! Jonathan Brinley wrote: This is somewhat off-topic, since you asked for something you can use on Linux. In any case... I've been using OmniPage 16, and I'm sorry to say I can't recommend it. You can't run it from the command line, so you can't really integrate it into a script. It does have a batch manager, so you can set it to do whole folders at a time. Just make sure your folder's not too large; it crashes fairly reliably after about 10-40 pages. If you do use OmniPage to make your PDFs, I've found that it works best to convert a single TIFF into a single-page PDF, then use pdftk[1] (along with a [language of your choice] script) to put those PDFs together however you want them. Have a nice day, Jonathan [1] http://www.accesspdf.com/pdftk/ - -- - --- James Tuttle Digital Repository Librarian NCSU Libraries, Box 7111 North Carolina State University Raleigh, NC 27695-7111 [EMAIL PROTECTED] (919)513-0651 Phone (919)515-3031 Fax -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI+QviKxpLzx+LOWMRAp1gAJ9ipNqWDxNPubPIl9qoo00XWqrn0gCgkR1R fDkLic6eBVmRr6G4rvVSU3s= =ySuL -END PGP SIGNATURE-