[CODE4LIB] Code4Lib Issue 22 Published

2013-10-14 Thread sara amato
The Code4Lib Journal editors are pleased to bring you this latest issue.   You 
can find it at
http://journal.code4lib.org/issues/issues/issue22; titles and abstracts below.




Editorial Introduction: Join Us at the Table
by Sara Amato
URL: http://journal.code4lib.org/articles/9052

The Call for Editors closes this Friday!  See:  
http://serials.infomotions.com/code4lib/archive/2013/201309/3567.html


VIAFbot and the Integration of Library Data on Wikipedia
by Maximilian Klein and Alex Kyrios
URL: http://journal.code4lib.org/articles/8964

This article presents a case study of a project, led by Wikipedians in 
Residence at OCLC and the British Library, to integrate authority data from the 
Virtual International Authority File (VIAF) with biographical Wikipedia 
articles. This linking of data represents an opportunity for libraries to 
present their traditionally siloed data, such as catalog and authority records, 
in more openly accessible web platforms. The project successfully added 
authority data to hundreds of thousands of articles on the English Wikipedia, 
and is poised to do so on the hundreds of other Wikipedias in other languages. 
Furthermore, the advent of Wikidata has created opportunities for further 
analysis and comparison of data from libraries and Wikipedia alike. This 
project, for example, has already led to insights into gender imbalance both on 
Wikipedia and in library authority work. We explore the possibility of similar 
efforts to link other library data, such as classification schemes, in 
Wikipedia.



From Finding Aids to Wiki Pages: Remixing Archival Metadata with RAMP
by Timothy A. Thompson, James Little, David González, Andrew Darby, and Matt 
Carruthers
URL: http://journal.code4lib.org/articles/8962

The Remixing Archival Metadata Project (RAMP) is a lightweight web-based 
editing tool that is intended to let users do two things: (1) generate enhanced 
authority records for creators of archival collections and (2) publish the 
content of those records as Wikipedia pages. The RAMP editor can extract 
biographical and historical data from EAD finding aids to create new authority 
records for persons, corporate bodies, and families associated with archival 
and special collections (using the EAC-CPF format). It can then let users 
enhance those records with additional data from sources like VIAF and WorldCat 
Identities. Finally, it can transform those records into wiki markup so that 
users can edit them directly, merge them with any existing Wikipedia pages, and 
publish them to Wikipedia through its API.



Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, Implications for 
Discovery Systems
By M. Bron, M. Proffitt and B. Washburn
URL: http://journal.code4lib.org/article/8956

The ArchiveGrid discovery system is made up in part of an aggregation of EAD 
(Encoded Archival Description) encoded finding aids from hundreds of 
contributing institutions. In creating the ArchiveGrid discovery interface, the 
OCLC Research project team has long wrestled with what we can reasonably do 
with the large (120,000+) corpus of EAD documents. This paper presents an 
analysis of the EAD documents (the largest analysis of EAD documents to date). 
The analysis is paired with an evaluation of how well the documents support 
various aspects of online discovery. The paper also establishes a framework for 
thresholds of completeness and consistency to evaluate the results. We find 
that, while the EAD standard and encoding practices have not offered support 
for all aspects of online discovery, especially in a large and heterogeneous 
aggregation of EAD documents, current trends suggest that the evolution of the 
EAD standard and the shift from retrospective conversion to new shared tools 
for improved encoding hold real promise for the future.



Fedora Commons With Apache Hadoop: A Research Study
By Mohamed Mohideen Abdul Rasheed
URL:  http://journal.code4lib.org/article/8988

The Digital Collections digital repository at the University of Maryland 
Libraries is growing and in need of a new backend storage system to replace the 
current filesystem storage. Though not a traditional storage management system, 
we chose to evaluate Apache Hadoop because of its large and growing community 
and software ecosystem. Additionally, Hadoop’s capabilities for distributed 
computation could prove useful in providing new kinds of digital object 
services and maintenance for ever increasing amounts of data. We tested storage 
of Fedora Commons data in the Hadoop Distributed File System (HDFS) using an 
early development version of Akubra-HDFS interface created by Frank Asseg. This 
article examines the f

[CODE4LIB] Job Posting / Temporary Law Library Technician / San Francisco, CA

2013-10-14 Thread Suzanne Richards
Apologies for the cross postings   .   .   .   .   .



LAC Group is seeking a Temporary Law Library Technician to work full-time on a 
project for our client, an international law firm.This position will work 
full-time in the firm's San Francisco office for approximately 1-2 weeks to 
finish up an ongoing project.  We are looking for someone able to start 
immediately and commit full-time for the 1-2 week duration.

Responsibilities:

§  Compile a title inventory of all print publications owned by the firm;

§  Inventory all electronic resources licenses and contracts and set-up a 
system that creates notification;

§  Weed the collection of all superseded or discontinued titles now on the 
shelves for review and make a decision to re-integrating the books into the 
active collection or keep them segregated;

§  Update, edit or newly create catalog records to reflect the active and 
historical collections;

§  Create and affix new or amended spine labels with library classification 
numbers to the remaining collections;

§  Physically reorganize the collection to reflect the classification schema.



Qualifications:

§  Previous experience working in a library is required; working in a law firm 
library is highly desired/preferred;

§  Previous experience in collection management, shelving, shifting and weeding 
collections;

§  Previous experience working in an ILS system and library catalog; experience 
using InMagic preferred;

§  Must have excellent communication skills, both written and verbal;

§  The ability to multi-task with strong attention to detail;

§  Knowledge and experience working with MS Access and Excel is desired;

§  The ability to lift and move materials approximately 25 - 40 lbs. in weight.



For immediate consideration, please apply at: http://goo.gl/AV7fsU



LAC Group is an Equal Opportunity/Affirmative Action employer and values 
diversity in the workforce.

LAC Group is a premier provider of recruiting and consultancy services for 
information professionals at U.S. and global organizations including Fortune 
100 companies, law firms, pharmaceutical companies, large academic 
institutions, National Libraries and prominent government agencies.


Re: [CODE4LIB] pdf2txt

2013-10-14 Thread Robert Haschart

Eric,

Very interesting.  I've have been working with some existing pdf 
utilities with a goal of automatically extracting the abstract from 
technical reports, articles and dissertations that are to be bulk 
uploaded to our institutional repository.   I tried two of our documents 
through your system and the first one worked great.

The second tech report I tried however generated this error message:

Software error:

No words from which to create a cloud - see add(...). at 
/usr/local/share/perl5/HTML/TagCloud/Centred.pm line 229.


For help, please send mail to the webmaster (root@localhost), giving 
this error message and the time and date of the error.



Although based on some subsequent messages where you mention tesseract 
maybe I misunderstood and your tool only handles pdfs that have already 
been OCR'ed which would explain why the second document (which only 
contains page images) fails.


-Bob Haschart


On 10/11/2013 11:16 AM, Eric Lease Morgan wrote:

For a limited period of time I am making publicly available a Web-based program 
called PDF2TXT -- http://bit.ly/1bJRyh8

PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary "distant 
reading" against the text in the form of word clouds, readability scores, concordance 
features, and "maps" (histograms) illustrating where terms appear in a text.

Here is the idea behind the application:

   1. In the Libraries I see people scanning, scanning, and
  scanning. I suppose these people then go home and read the
  document. They might even print it. These documents are long.
  Moreover, I'll bet they have multiple documents.

   2. Text mining requires digitized text, but PDF documents are
  frequently full of formatting. At the same time, they often
  have the text underneath. Our scanning software does OCR.

   3. By extracting the text from PDF documents, I can facilitate
  a different -- additional -- type of analysis against sets of
  one or more documents. PDF2TXT is the first step in this
  process.

What is really cool is that PDF2TXT works for many of the articles downloadable 
from the Libraries's article indexes. Search an article index. Download a full 
text, PDF version of the article. Feed it to PDF2TXT. Get more out of your 
article.

PDF2TXT currently has "creeping featuritis" -- meaning that it is growing in 
weird directions. Your feedback is more than welcome. (I know. The output is ugly.) Also, 
please be gentle with it because it does not process things the size of the Bible.

--
[cid:116F6092-2AB6-4E95-8199-25639542726A]

Eric Lease Morgan
Digital Initiatives Librarian

University of Notre Dame
Room 131, Hesburgh Libraries
Notre Dame, IN 46556
o: 574-631-8604
e: emor...@nd.edu

[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]




Re: [CODE4LIB] Google Analytics on multiple systems

2013-10-14 Thread Amy Vecchione
Hello Joel and all,

We customized one dashboard that forced all of our vended products to
identify into one singular dashboard and then tease them out with filters
(sort of like this presentation
http://conferences.infotoday.com/documents/158/B104_Hess.pdf) . I didn't
like that, so we using the "events" feature to push a tick every time a
link is clicked on our web page. In this way I can tease out access to each
vended site a little bit easier. More on events here:
https://developers.google.com/analytics/devguides/collection/gajs/eventTrackerGuide

Amy




On Mon, Oct 14, 2013 at 12:36 PM, Joel Marchesoni wrote:

> Hello,
>
> We currently have Google Analytics on our main library pages and digital
> collections pages on the same domain. Now that CONTENTdm has a GA "easy
> button" we are going to add Analytics to it as well, and while we're at it
> probably LibGuides and non-authenticated ILLiad pages (I mainly want to see
> how big a percentage of mobile hits ILLiad gets) as well. I was hoping to
> hear from the list whether you have all "service points" in one GA account
> or a separate account for each one, and why.
>
> Thanks,
>
> Joel Marchesoni
> Tech Support Analyst
> Hunter Library, Western Carolina University
> http://library.wcu.edu/
> 828-227-2860
> ~Please consider the environment before printing this email~
>



-- 
Amy Vecchione, Digital Access Librarian/Assistant Professor
http://works.bepress.com/amy_vecchione/
Albertsons Library, Boise State University, L212
http://library.boisestate.edu
(208) 426-1625


[CODE4LIB] Google Analytics on multiple systems

2013-10-14 Thread Joel Marchesoni
Hello,

We currently have Google Analytics on our main library pages and digital 
collections pages on the same domain. Now that CONTENTdm has a GA "easy button" 
we are going to add Analytics to it as well, and while we're at it probably 
LibGuides and non-authenticated ILLiad pages (I mainly want to see how big a 
percentage of mobile hits ILLiad gets) as well. I was hoping to hear from the 
list whether you have all "service points" in one GA account or a separate 
account for each one, and why.

Thanks,

Joel Marchesoni
Tech Support Analyst
Hunter Library, Western Carolina University
http://library.wcu.edu/
828-227-2860
~Please consider the environment before printing this email~


[CODE4LIB] ANNOUNCEMENT: Traject MARC->Solr indexer release

2013-10-14 Thread Jonathan Rochkind
Jonathan Rochkind (Johns Hopkins) and Bill Dueber (University of 
Michigan), are happy to announce a robust, feature-complete beta release 
of "traject," a tool for indexing MARC data to Solr.


traject, in the vein of solrmarc, allows you to define your indexing 
rules using simple macro and translation files. However, traject runs 
under JRuby and is "ruby all the way down," so you can easily provide 
additional logic by simply requiring ruby files.


There's a sample configuration file to give you a feel for traject[1].

You can view the code[2] on github, and easily install it as a (jruby) 
gem using "gem install traject".


traject is in a beta release hoping for feedback from more testers prior 
to a 1.0.0 release, but it is already being used in production to 
generate the HathiTrust (metadata-lookup) Catalog 
(http://www.hathitrust.org/). traject was developed using a test-driven 
approach and has undergone both continuous integration and an extensive 
benchmarking/profiling period to keep it fast. It is also well covered 
by high-quality documentation.


Feedback is very welcome on all aspects of traject including 
documentation, ease of getting started, features, any problems you have, 
etc.


What we think makes traject great:

* It's all just well-crafted and documented ruby code; easy to program, 
easy to read, easy to modify (the whole code base is only 6400 lines of 
code, more than a third of which is tests)
* Fast. Traject by default indexes using multiple threads, so you can 
use all your cores!
* Decoupled from specific readers/writers, so you can use ruby-marc or 
marc4j to read, and write to solr, a debug file, or anywhere else you'd 
like with little extra code.

* Designed so it's easy to test your own code and distribute it as a gem

We're hoping to build up an ecosystem around traject and encourage 
people to ask questions and contribute code (either directly to the 
project or via releasing plug-in gems).


[1] 
https://github.com/traject-project/traject/blob/master/test/test_support/demo_config.rb

[2] http://github.com/traject-project/traject


[CODE4LIB] OLAC Movie & Video Credit Annotation Experiment

2013-10-14 Thread Kelley McGrath
This project may be of interest to some on this list as an experiment to 
explore extracting structured data from free text in MARC. You also have a 
chance to help make it easier to find film and video in libraries if you're 
willing to take a few minutes to participate.

OLAC (http://www.olacinc.org/) is working on project to try to make the process 
of finding film and video in library catalogs better. Please help us by 
annotating some film and video credits at http://olac-annotator.org/. It only 
takes a few minutes to make a contribution. We are challenging OLAC members to 
do annotate three credits per day this week to see how many we can get done. 
Please join us in this endeavor. We are especially looking for people who know 
languages other than English to help us translate credits in languages from 
Chinese to Spanish to Urdu. Full announcement below. Please share this 
information with anyone you think might be interested.

Kelley

***

The OLAC Movie & Video Credit Annotation Experiment (http://olac-annotator.org) 
is part of a larger project to make it easier to find film and video in 
libraries and archives. In the current phase, we're trying to break existing 
MARC movie records down and pull out all the cast and crew information so that 
it may be re-ordered and manipulated. We also want to make explicit connections 
between cast and crew names and their roles or functions in the movie 
production. Adding these formal connections to movie records will allow us to 
provide a better user experience. For example, library patrons would be able to 
search just for directors or just for cast members or only for movies where 
Clint Eastwood is actually in the cast rather than all the movies that he is 
connected with. Libraries would have the flexibility to create more 
standardized and readable displays of production credits, such as you see at 
IMDb (see http://www.imdb.com/title/tt1205489/ -- not that we necessarily want 
IMDb's display, but that we would have much more flexibility in designing 
displays) , rather than views like a typical library catalog (such as 
http://janus.uoregon.edu/record=b3958782).

We therefore want to convert our existing records into more structured sets of 
data. Eventually, we intend to automate most of this conversion. For now, we 
need help from human volunteers, who can train our software to recognize the 
many ways names and roles have been listed in library records for movies. Give 
us a hand at http://olac-annotator.org. For an explanation with more library 
jargon thrown in, see http://olac-annotator.org/#/more.

The OLAC Movie & Video Credit Annotation Experiment was conceived by Kelley 
McGrath, developed by Chris Fitzpatrick and funded by a Richard and Mary 
Corrigan Solari Library Fellowship Incentive Award from the University of 
Oregon Libraries.


Kelley McGrath
Metadata Management Librarian
University of Oregon Libraries
541-346-8232
kell...@uoregon.edu


[CODE4LIB] Job: JOB: Research/Scholarship Initiatives Manager at Columbia University at Columbia University

2013-10-14 Thread jobs
Apologies for cross-posting -- but with encouragement to
pass along this announcement broadly!

  
Research/Scholarship Initiatives Manager at Columbia University

  
The Center for Digital Research and Scholarship (CDRS) (cdrs.columbia.edu), a
division of the Columbia University Libraries/Information Services (CUL/IS),
seeks a full-time Research/Scholarship Initiatives Manager. Reporting to CDRS'
director, this newly created position will be responsible for devising,
implementing, and leading outreach activities promoting CDRS' digital
projects, the institutional repository, and scholarly communication
initiatives on campus, helping to position CDRS and CUL/IS as key partners in
the understanding, creation, and adoption of new modes of scholarly
communication. In particular, the Research/Scholarship Initiatives Manager
will be expected to advance access to locally created online publications,
conference proceedings, electronic theses and dissertations, Columbia-produced
journals, data sets, born-digital research output, and other digital content
and to strengthen the linchpin role of the institutional repository, Academic
Commons, as a research platform. He/she will work closely with library
colleagues and with University faculty to identify and to advise on issues
related to intellectual property and open access of scholarly output at the
University. Through actively monitoring national and international trends,
standards, and policies in scholarly communication and by representing CDRS,
CUL/IS, and the University in local, regional, national, and international
forums and organizations relating to scholarly communication and digital
scholarship, the Research/Scholarship Initiatives Manager will also serve as a
resource on local and national policy to help the University community stay
informed about and engaged with the changing landscape for scholarly
communication.

  
Position Duties

● Oversees the institutional repository service, which includes supervising
repository staff and interns. In collaboration with other library personnel,
develops services and policies related to the repository infrastructure.

● Oversees the evaluation and development of new research and scholarship
services and initiatives through ongoing identification and evaluation of
emerging technologies, researchers' and scholars' functional and sociological
requirements, user-centered design, usage analytics, and user assessment.

● Promotes CDRS' digital projects and publishing initiatives and institutional
repository program; supports sustainable scholarly communication at Columbia;
devises, implements, and leads outreach activities to enhance and support the
services of CDRS and participates in collective marketing and outreach
efforts. Along with other members of the CDRS staff, develops and oversees
programming and engagement activities, especially in creating digital
publishing and research life cycle resources and leading workshops for the
campus community (e.g., on open access), typically in collaboration with
library and other campus partners.

● Monitors and reports on current developments in scholarly communication,
open access and alternative publishing models, institutional repositories, and
related legislative initiatives. Informs library colleagues, research faculty,
graduate students, and University administrators of changes in scholarly
communication, on new technologies available to them, and on ways in which
they can contribute to new and evolving methods for distribution of research
results. Works collaboratively with faculty to promote and support digital
publishing and open access. Encourages experimentation and risk-taking in
digital scholarship projects.

● Represents CDRS and CUL/IS at workshops, institutes, seminars, and
conferences at local, state, regional, national, and international levels and
serves on standing inter-institutional- and institutional-level committees and
task forces related to digital research and scholarship. Shares with library
colleagues and departmental faculty and staff relevant information gained from
professional activities and uses that knowledge to inform and improve CDRS'
operations and services.

● Identifies funding opportunities and contributes to the writing of grants.

  
Required Qualifications

● Bachelor's degree required. Advanced degree desirable. Minimum 4-6 years'
related experience, including two or more years of closely related experience,
preferably in an academic library.

● Understanding of current issues, trends, and new and emerging technologies
in scholarly communication and their effect on the academic environment,
especially those of cyberinfrastructure and cyberscholarship. Awareness of
current issues in scholarly communication (e.g., open access, author rights,
copyright, fair use, deposit mandates and resolutions, data sharing). Clear
interest in staying abreast of new digital project techniques, repository
trends, and best practices (e.g., DOIs, OR

[CODE4LIB] Job: Research Informationist at University of Cincinnati Libraries at University of Cincinnati Libraries

2013-10-14 Thread jobs
**POSITION ANNOUNCEMENT - RESEARCH INFORMATIONIST (Position Number 213UC5231)** 
 
  
Tenure-track, 12-month Faculty Appointment

  
University of Cincinnati Health Sciences Library

  
The Donald C. Harrison Health Sciences Library (HSL) seeks a knowledgeable,
motivated, and service oriented Research Informationist to deliver services
and resources to the Academic Health Center and UC Health research and
translational sciences community. The incumbent will work closely with other
HSL library staff to design, develop, and implement a suite of cohesive and
comprehensive services for the UC Academic Health Center and UC Health
research community. This is a full-time tenure track faculty appointment.

  
The full job description and application information is available at [http://w
ww.libraries.uc.edu/information/personnel/index.html](http://www.libraries.uc.
edu/information/personnel/index.html)

  
Apply at [www.jobsatuc.com](http://www.jobsatuc.com) (search position number
213UC5231) or call 513-558-6019 for assistance.

  
UC is an EE/AA employer.



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/10337/


Re: [CODE4LIB] edUi Discount

2013-10-14 Thread Sean Hannan
Hey, that's me! Come see me talk!

It's a good conference. I've gone the last two years. Cheapest
design-thinking conference you'll ever see.

-Sean

On 10/14/13 8:24 AM, "EdUI Conference"  wrote:

>This is how you do digital collections in
>2013n-2013/>


[CODE4LIB] edUi Discount

2013-10-14 Thread EdUI Conference
Just a reminder that there’s still time to register for the edUi
Conference(Nov. 4-6 Richmond, VA).

Code4lib readers can save $100 on registration (usually $550) with the discount
codelibrary .

Here are some session code4lib readers might like:

Reading, Writing, and Research in the Digital
Age
WordPress Themes 101 
Getting ‘Em on Board: Guiding Staff Through Times of
Change

Taking it 
Offline
Mobile for Dinosaurs 

This is how you do digital collections in
2013
A Web Designer’s Guide to Being
Lazy
Geo-discovery of Library Collections with Google Glass


Responsive Design: An Undead
Introduction

 Hope to see you there!



-Trey


Re: [CODE4LIB] pdf2txt

2013-10-14 Thread Nicolas Franck
Could this also be done by Apache Tika? Or do I miss a crucial point?

http://tika.apache.org/1.4/gettingstarted.html

Apparently it has a command-line utility that extract metadata and content from
various document formats, and prints it to the standard output. The output
can then be supplied to text-analysing tools like Solr.

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jodi Schneider 
[jschnei...@pobox.com]
Sent: Monday, October 14, 2013 11:22 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] pdf2txt

Hi Penelope,

Of the document you write, the key part for this discussion seems to be
this:

Some suggested ways to make the scanned information accessible in a
> seamless manner are:


>- The catalogue records to have two links one to the actual document
>and the other to a search page that enables searching all the documents.
>The second link could be something lile “Click here to go to the full-text
>search for departmental reports”.
>- An easy to use (user friendly) full-text search interface.
>
> So are you asking how to make a full-text search interface using the OCR
results from Eric's tool?

This doesn't at all answer your question, but gives a pointer to OCR
quality control:
"Case Study: Using Perl and CGI Scripts to Automate a Quality Control
Workflow for Scanned Congressional Documents"
http://journal.code4lib.org/articles/6731

I think if you ask a more particular question (that doesn't rely on reading
your draft), you might get a better answer.

-Jodi


On Mon, Oct 14, 2013 at 6:48 AM, Penelope Campbell <
penelope.campb...@facs.nsw.gov.au> wrote:

> Dear Eric,
> Thanks for this.
> As a small special library (solo librarian) in an Australian State
> Government Department I use DB/Text works which has a feature of
> importing documents so that the full text can be read. It though only
> imports the full-text not what you have done which is really great. I
>  wrote a small piece (see attached) explaining what I am in the process
> of doing. I am using the library catalogue records as metadata.  But I
> am hoping for something more.  I do really want to open up the
> collection and make the information discoverable more than just the
> Library catalogue . I had contacted Juame Nualart who wrote a paper on
> some ways to present terms called Texty.
> http://informationr.net/ir/18-2/paper581.html But it is not a piece of
> software. I am quite interested in what you have done. I am just tyring
> to work out a way to show relevancy and this may be something I could
> integrate into the Library catalogue.
>
> I hope you can take the time to reply to me.
> Thank you
>
> Penelope Campbell | Library Manager
> Department of Family and Community Services | Housing NSW
> T 02 8753 8732 | F 02 8753 8734
> A Ground Floor, 223-239 Liverpool Road Ashfield NSW, 2131
> A Locked bag 4001 Ashfield BC NSW, 1800
> E penelope.campb...@facs.nsw.gov.au
> W www.housing.nsw.gov.au
>
> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Eric Lease Morgan
> Sent: Saturday, 12 October 2013 2:16 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] pdf2txt
>
>
> For a limited period of time I am making publicly available a Web-based
> program called PDF2TXT -- http://bit.ly/1bJRyh8
>
> PDF2TXT extracts the text from an OCRed PDF document and then does some
> rudimentary "distant reading" against the text in the form of word
> clouds, readability scores, concordance features, and "maps"
> (histograms) illustrating where terms appear in a text.
>
>
> Here is the idea behind the application:
>
>   1. In the Libraries I see people scanning, scanning, and
>  scanning. I suppose these people then go home and read the
>  document. They might even print it. These documents are long.
>  Moreover, I'll bet they have multiple documents.
>
>   2. Text mining requires digitized text, but PDF documents are
>  frequently full of formatting. At the same time, they often
>  have the text underneath. Our scanning software does OCR.
>
>   3. By extracting the text from PDF documents, I can facilitate
>  a different -- additional -- type of analysis against sets of
>  one or more documents. PDF2TXT is the first step in this
>  process.
>
> What is really cool is that PDF2TXT works for many of the articles
> downloadable from the Libraries's article indexes. Search an article
> index. Download a full text, PDF version of the article. Feed it to
> PDF2TXT. Get more out of your article.
>
> PDF2TXT currently has "creeping featuritis" -- meaning that it is
> growing in weird directions. Your feedback is more than welcome. (I
> know. The output is ugly.) Also, please be gentle with it because it
> does not process things the size of the Bible.
>
> --
> [cid:116F6092-2AB6-4E95-8199-25639542726A]
>
> Eric Lease Morgan
> Digital Initiatives Librarian
>
> University of Notre Dame
> Roo

Re: [CODE4LIB] pdf2txt

2013-10-14 Thread Jodi Schneider
Hi Penelope,

Of the document you write, the key part for this discussion seems to be
this:

Some suggested ways to make the scanned information accessible in a
> seamless manner are:


>- The catalogue records to have two links one to the actual document
>and the other to a search page that enables searching all the documents.
>The second link could be something lile “Click here to go to the full-text
>search for departmental reports”.
>- An easy to use (user friendly) full-text search interface.
>
> So are you asking how to make a full-text search interface using the OCR
results from Eric's tool?

This doesn't at all answer your question, but gives a pointer to OCR
quality control:
"Case Study: Using Perl and CGI Scripts to Automate a Quality Control
Workflow for Scanned Congressional Documents"
http://journal.code4lib.org/articles/6731

I think if you ask a more particular question (that doesn't rely on reading
your draft), you might get a better answer.

-Jodi


On Mon, Oct 14, 2013 at 6:48 AM, Penelope Campbell <
penelope.campb...@facs.nsw.gov.au> wrote:

> Dear Eric,
> Thanks for this.
> As a small special library (solo librarian) in an Australian State
> Government Department I use DB/Text works which has a feature of
> importing documents so that the full text can be read. It though only
> imports the full-text not what you have done which is really great. I
>  wrote a small piece (see attached) explaining what I am in the process
> of doing. I am using the library catalogue records as metadata.  But I
> am hoping for something more.  I do really want to open up the
> collection and make the information discoverable more than just the
> Library catalogue . I had contacted Juame Nualart who wrote a paper on
> some ways to present terms called Texty.
> http://informationr.net/ir/18-2/paper581.html But it is not a piece of
> software. I am quite interested in what you have done. I am just tyring
> to work out a way to show relevancy and this may be something I could
> integrate into the Library catalogue.
>
> I hope you can take the time to reply to me.
> Thank you
>
> Penelope Campbell | Library Manager
> Department of Family and Community Services | Housing NSW
> T 02 8753 8732 | F 02 8753 8734
> A Ground Floor, 223-239 Liverpool Road Ashfield NSW, 2131
> A Locked bag 4001 Ashfield BC NSW, 1800
> E penelope.campb...@facs.nsw.gov.au
> W www.housing.nsw.gov.au
>
> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Eric Lease Morgan
> Sent: Saturday, 12 October 2013 2:16 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] pdf2txt
>
>
> For a limited period of time I am making publicly available a Web-based
> program called PDF2TXT -- http://bit.ly/1bJRyh8
>
> PDF2TXT extracts the text from an OCRed PDF document and then does some
> rudimentary "distant reading" against the text in the form of word
> clouds, readability scores, concordance features, and "maps"
> (histograms) illustrating where terms appear in a text.
>
>
> Here is the idea behind the application:
>
>   1. In the Libraries I see people scanning, scanning, and
>  scanning. I suppose these people then go home and read the
>  document. They might even print it. These documents are long.
>  Moreover, I'll bet they have multiple documents.
>
>   2. Text mining requires digitized text, but PDF documents are
>  frequently full of formatting. At the same time, they often
>  have the text underneath. Our scanning software does OCR.
>
>   3. By extracting the text from PDF documents, I can facilitate
>  a different -- additional -- type of analysis against sets of
>  one or more documents. PDF2TXT is the first step in this
>  process.
>
> What is really cool is that PDF2TXT works for many of the articles
> downloadable from the Libraries's article indexes. Search an article
> index. Download a full text, PDF version of the article. Feed it to
> PDF2TXT. Get more out of your article.
>
> PDF2TXT currently has "creeping featuritis" -- meaning that it is
> growing in weird directions. Your feedback is more than welcome. (I
> know. The output is ugly.) Also, please be gentle with it because it
> does not process things the size of the Bible.
>
> --
> [cid:116F6092-2AB6-4E95-8199-25639542726A]
>
> Eric Lease Morgan
> Digital Initiatives Librarian
>
> University of Notre Dame
> Room 131, Hesburgh Libraries
> Notre Dame, IN 46556
> o: 574-631-8604
> e: emor...@nd.edu
>
> [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
>
> ==
>
> Security Statement
>
> This email may be confidential and contain privileged information. If you
> are not the intended recipient you must not use, disclose, copy or
> distribute this email, including any attachments. Confidentiality and legal
> privilege attached to this communication are not waived or lost by reason
> of mistaken delivery to you. If