from:"Eric Lease Morgan"

[CODE4LIB] position announcement [tulane university]

2017-03-09 Thread Eric Lease Morgan

[The following position announcement is being forwarded upon request. —ELM]


> We are currently hiring for the Applications Developer III position at the 
> Howard-Tilton Memorial Library at Tulane University located in New Orleans, 
> Louisiana.
> 
> Please see the job details here: http://bit.ly/2nb119e 
> 
> To see a listing of all open positions  available at Howard-Tilton please 
> visit our website: http://library.tulane.edu/about/job-opportunities
> 
> --
> Candace Maurice
> Web Developer
> Howard-Tilton Memorial Library
> Tulane University
> 504.314.7784
> cmaur...@tulane.edu

[CODE4LIB] on hold

2016-07-19 Thread Eric Lease Morgan

As of this message, I’m putting the Code4Lib mailing list “on hold” while the 
list’s configurations and archives get moved from one place to another. ‘More 
soon, and this process will take at least a day. Please be patient. —Eric Lease 
Morgan

Re: [CODE4LIB] code4lib mailing list

2016-07-13 Thread Eric Lease Morgan

> Alas, the Code4Lib mailing list software will most likely need to be migrated 
> before the end of summer…

On Monday Wayne Graham (CLIR/DLF) and I are hoping to migrate the Code4Lib 
mailing list to a different domain. We don’t think any archives, subscriptions, 
nor preferences will get lost in the process. (“Famous last words.”) Wish us 
luck. —Eric Morgqn

[CODE4LIB] mashcat

2016-07-12 Thread Eric Lease Morgan

The following Mashcat event seems more than apropos to our group:

  We are excited to announce that the second face-to-face Mashcat
  event in North America will be held on January 24th, 2017, in
  downtown Atlanta, Georgia, USA. We invite you to save the date.
  We will be sending out a call for session proposals and opening
  up registration in the late summer and early fall.

  Not sure what Mashcat is? “Mashcat” was originally an event in
  the UK in 2012 aimed at bringing together people working on the
  IT systems side of libraries with those working in cataloguing
  and metadata. Four years later, Mashcat is a loose group of
  metadata specialists, cataloguers, developers and anyone else
  with an interest in how metadata in and around libraries can be
  created, manipulated, used and re-used by computers and software.
  The aim is to work together and bridge the communications gap
  that has sometimes gotten in the way of building the best tools
  we possibly can to manage library data. Among our accomplishments
  in 2016 was holding the first North American face-to-face event
  in Boston in January and running webinars. If you’re unable to
  attend a face-to-face meeting, we will be holding at least one
  more webinar in 2016.

  http://bit.ly/29FuUuY

Actually, the mass-editing of cataloging (MARC) data is something that is 
particularly interesting to me these days. Hand-crafted metadata records are 
nice, but increasingly unscalable.

—
Eric Lease Morgan

Re: [CODE4LIB] date fields

2016-07-12 Thread Eric Lease Morgan

On Jul 11, 2016, at 4:32 PM, Kyle Banerjee  wrote:

>> https://github.com/traject/traject/blob/e98fe35f504a2a519412cd28fdd97dc514b603c6/lib/traject/macros/marc21_semantics.rb#L299-L379
> 
> Is the idea that this new field would be stored as MARC in the system (the
> ILS?).
> 
> If so, the 9xx solution already suggested is probably the way to go if the
> 008 route suggested earlier won't work for you. Otherwise, you run a risk
> that some form of record maintenance will blow out all your changes.
> 
> The actual use case you have in mind makes a big difference in what paths
> make sense, so more detail might be helpful.

Thank you, one & all, for the input & feedback. After thinking about it for a 
while, I believe I will save my normalized dates in a local (9xx) field of some 
sort.

My use case? As a part of the "Catholic Portal", I aggregate many different 
types of metadata and essentially create a union catalog of rare and 
infrequently held materials of a Catholic nature. [1] In an effort to measure 
“rarity” I've counted and tabulated the frequency of a given title in WorldCat. 
I now want to measure the age of the materials in the collection. To do that I 
need to normalize dates and evaluate them. Ideally I would save the normalized 
dates back in MARC and give the MARC back to Portal members libraries, but 
since there is really no standard field for such a value, anything I choose is 
all but arbitrary. I’ll use some 9xx field, just to make things easy. I can 
always (and easily) change it later.

[1] "Catholic Portal” - http://www.catholicresearch.net

—
Eric Lease Morgan

[CODE4LIB] date fields

2016-07-11 Thread Eric Lease Morgan

I’m looking for date fields.

Or more specifically, I have been given a pile o’ MARC records, and I will be 
extracting for analysis the values of dates from MARC 260$c. From the resulting 
set of values — which will include all sorts of string values ([1900], c1900, 
190?, 19—, 1900, etc.) — I plan to normalize things to integers like 1900. I 
then want to save/store these normalized values back to my local set of MARC 
records. I will then re-read the data to create things like timelines, to 
answer questions like “How old is old?”, or to “simply” look for trends in the 
data.

What field would y’all suggest I use to store my normalized date content?

—
Eric Morgan

Re: [CODE4LIB] code4lib mailing list [clir]

2016-06-14 Thread Eric Lease Morgan

On Jun 7, 2016, at 10:11 AM, Eric Lease Morgan  wrote:

>>> Alas, the Code4Lib mailing list software will most likely need to be 
>>> migrated before the end of summer, and I’m proposing a number possible 
>>> options for the lists continued existence...
>> 
>> Our list — Code4Lib — will be migrating to the Digital Library Federation 
>> (DLF) sometime in the near future. 
> 
> This is a gentle reminder that the Code4Lib mailing list will be migrating to 
> a different address sometime in the very near future. Specifically, it will 
> be migrating to the Digital Library Federation. I suspect this work will be 
> finished in less than thirty days, and when I know the exact address of the 
> new list, I will share it here.
> 
> Thanks go to the DLF in general, and specifically Wayne Graham and Bethany 
> Nowviskie for enabling this to happen. “Thanks!”

Yet again, this is a reminder that the mailing list will be moving, and I think 
the list's address will be associated with CLIR (Council on Library and 
Information Resources), which is the host of the DLF (Digital Library 
Federation). [1, 2]

Wayne Graham & I (actually, mostly Wayne) have been practicing with the 
migration process. We have managed to move the archives and the subscriber list 
(complete with subscription preferences) to a new machine. We — Wayne & I — now 
need to coordinate to do the move for real. To do so we will put the mailing 
list on “hold”, copy things from one computer to another, and then “release” 
the new implementation. The only things that will get lost in the migration 
process are messages sent to the older implementation. Consequently, people 
will need to start sending messages to a new address. I’m not sure, but this 
migration might start happening very early next week — June 20. 

Now back to our regularly scheduled programming (all puns intended).

[1] CLIR - http://clir.org
[2] DLF - https://www.diglib.org

—
Eric Lease Morgan

Re: [CODE4LIB] Formalizing Code4Lib?

2016-06-14 Thread Eric Lease Morgan

On Jun 14, 2016, at 8:01 PM, Coral Sheldon-Hess  wrote:

> Now, there kind of is. By my count, we have 4 volunteers. Chad, Tom, Galen,
> and me. Anyone else?

  Coral, please sign me up. I’d like to learn more. —Eric Lease Morgan

Re: [CODE4LIB] Formalizing Code4Lib? [diy]

2016-06-10 Thread Eric Lease Morgan

On Jun 9, 2016, at 7:55 PM, Coral Sheldon-Hess  wrote:

> One note about what we're discussing: when we talk about just doing the
> regional events (and I mean beyond 2017, which will be a special case if a
> host city can't step in), we need to realize that we have a lot of members
> who aren't in a Code4Lib region.
> 
> You might think I'm talking about Alaska, because that's where I lived when
> I first came to a Code4Lib conference. And that's certainly one place,
> along with Hawaii, that would be left out.
> 
> But even living in Pittsburgh, I'm not in a Code4Lib region, that I can
> tell. Pittsburgh isn't in the midwest, and we also aren't part of the
> tri-state region that Philly's in. I'm employed (part-time/remote) in the
> DC/MD region, so if I can afford the drive and hotel, that's probably the
> one I'd pick right now. I guess?
> 
> So, even landlocked in the continental US, it's possible not to have a
> region.
> 
> More importantly, though: my understanding is that our international
> members are fairly spread out -- maybe Code4Lib Japan being an exception?
> -- so, even ignoring weird cases like Pittsburgh, we stand to lose some
> really fantastic contributors to our community if we drop to regional-only.
> 
> Just something else to consider.
> - Coral


Interesting. Consider searching one or more of the existing Code4Lib mailing 
list archives for things Pittsburg:

  * https://www.mail-archive.com/code4lib@listserv.nd.edu/
  * http://serials.infomotions.com/code4lib/
  * https://listserv.nd.edu/cgi-bin/wa?A0=CODE4LIB

I’d be willing to be you can identify six or seven Code4Lib’ers in the results. 
You could then suggest a “meet-up”, a get together over lunch, or to have them 
visit you in your space or a near-by public library. Even if there are only 
three of you, then things will get started, and it will grow from there. I 
promise. —Eric Morgan

Re: [CODE4LIB] Formalizing Code4Lib? [diy]

2016-06-08 Thread Eric Lease Morgan

 restaurants for larger groups.

  8) Do the event - On the day of the event, make sure you have name tags, 
lists of attendees, and logistical instructions such as connecting to the 
wi-fi. Have volunteers who want to help greet attendees, organize eating 
events, or lead tours. That is easy. Libraries are full of “service-oriented 
people”. Use the agenda as an outline, not a rule book. Smile. Breath. Have 
fun. Play host to a party. Understand the problem you are trying to solve — 
communication & sharing. Let it flow. Don’t constantly ask yourself, “What if…” 
because if you do, then I’m going to ask you, “What are you going to do if a 
cow comes into the library?” and I’m going to expect an answer. 

  9) Record the event - Have people take notes on the sessions, and then hope 
they write up their notes for later publishing. Video streaming is expensive 
and over the top. Gather up people’s presentation materials and republish them.

 10) End the event - Graciously say good-bye, clean up, and rest. Put the 
coordination on your vita and as a part of your annual review.

 11) Evaluate - Follow-up with the people who attended. Ask them what they 
thought worked well and didn’t work well. Record this feedback on the Web page. 
This is all a part of the communication process.

 12) Repeat - Go to Step #1 because this is a never-ending process. 

Now let’s talk about attendee costs. A national meeting almost always requires 
airfare, so we are talking at least a couple hundred dollars. Then there is the 
stay in the “cool” hotel which is at least another hundred dollars per night. 
Taxi fare. Meals. Registration. Etc. Seriously, how much are you going to 
spend? Think about spending that same amount of money more directly for the 
local/regional meeting. If you really wanted to, coordinate with your 
colleagues and sponsor a caterer. Carpool with your colleagues to the event. 
Coordinate with your colleagues and sponsor a tour. Coordinate with your 
colleagues and sponsor video streaming. In the end, I’m positive everybody will 
spend less money.

What do you get? In the end you get a whole lot of professional networking with 
a relatively small group of people. And since they are regional, you will 
continue relationships with them. Want to network with people outside your 
region? No problem. Look on the Code4Lib wiki, see what's playing next, and 
attend the meeting.

Instead of centralization — like older mainframe types of computing — I suggest 
we embrace the ideas of de-centralization a la the Internet and TCP/IP. This 
way, there is no central thing to break, and everything will just find another 
path to get to where it is going. Instead of one large system — let’s call it 
the integrated library system — let’s employ the Unix Way and have lots of 
tools that do one thing and one thing well. When smaller, lesser expensive 
scholarly journal publishers get tired and find the burden to cumbersome, what 
do they do? They associate themselves with a sort of fiduciary who takes on 
financial responsibilities as well as provides a bit of safety. And then what 
happens to those publications? Hmmm… Can anybody say, “Serials pricing crisis?”

Let’s forgo identifying a fiduciary for a while. What will they facilitate? The 
funding of a large meeting space in a “fancy” hotel? Is that really necessary 
when the same communication & sharing can be done on a smaller, lesser 
expensive, and more intimate scale? DIY. 

† Here’s a really tricky idea. Do what the TEI people do. Identify a time and 
place where many similar people are having a meeting, and then sponsor a 
Code4Lib-specific event on either end of the first meeting. NASIG? DLF? ACRL? 
Call it a symbiotic relationship.

—
Eric Lease Morgan

Re: [CODE4LIB] Formalizing Code4Lib?

2016-06-07 Thread Eric Lease Morgan

On Jun 7, 2016, at 10:53 PM, Mike Giarlo  wrote:

>>> I'm also interested in investigating how to formalize Code4Lib as an
>>> entity, for all of the reasons listed earlier in the thread…
>> 
>> -1 because I don’t think the benefits will outweigh the emotional and 
>> bureaucratic expense. We already have enough rules.
> 
> Can you say more about what you expect "the emotional and bureaucratic 
> expense" to be?

Bureaucratic and emotional expenses include yet more committees and politics. 
Things will happen increasingly slowly. Our community will be less nimble and 
somewhat governed by outside forces. We will end up with presidents, 
vice-presidents, secretaries, etc. Increasingly there will be “inside” and 
“outside”. The inside will make decisions and the outside won’t understand and 
feel left out. That is what happens when formalization take place.

The regional conferences are good things. I call them franchises. The annual 
meeting does not have to be a big deal, and the smaller it is, the less 
financial risk there will be. Somebody will always come forward. It will just 
happen.

—
Eric Lease Morgan

Re: [CODE4LIB] Formalizing Code4Lib?

2016-06-07 Thread Eric Lease Morgan

> I'm also interested in investigating how to formalize Code4Lib as an
> entity, for all of the reasons listed earlier in the thread…


-1 because I don’t think the benefits will outweigh the emotional and 
bureaucratic expense. We already have enough rules. 

—
ELM

[CODE4LIB] viaf and the levenshtein algorithm

2016-06-07 Thread Eric Lease Morgan

In the past few weeks I have had some interesting experiences with WorldCat, 
VIAF, and the Levenshtein algorithm. [1, 2]

In short, I was given a set of authority records with the goal of associating 
each name with a VIAF identifier. To accomplish this goal I first created a 
rudimentary database — an easily parsed list of MARC 1xx fields. I then looped 
through the database, and searched VIAF via the AutoSuggest interface looking 
for one-to-one matches. If found, I updated my database with the VIAF 
identifier. The AutoSuggest interface was fast but only able to associate 20% 
of my names with identifiers. (Moreover, I don’t know how it works; AutoSuggest 
is a “black box” technology.)

I then looped through the database again, but this time I queried VIAF using 
the SRU interface. Searches often returned many hits, not just one-to-one 
matches, but through the use of the Levenshtein algorithm I was able to 
intelligently select items from the search results and update my database 
accordingly. [3] Through the use of the SRU/Levenshtein combination, I was able 
to associate another 50-55 percent of my names with identifiers.

Now that I have close to 75% of my names associated with VIAF identifiers, I 
can update my authority list’s MARC 024 fields, in turn, I can then provide 
enhanced services against my catalog as well as pave the way for linked data 
implementations.

Sometimes our library automation tasks can use a bit more computer science. 
Librarianship isn’t all about service and the humanities. Librarianship is an 
arscient discipline. [4]

[1] VIAF Finder - http://infomotions.com/blog/2016/05/viaf-finder/
[2] Almost perfection - http://infomotions.com/blog/2016/06/levenshtein/
[3] Levenshtein - https://en.wikipedia.org/wiki/Levenshtein_distance
[4] arscience - http://infomotions.com/blog/2008/07/arscience/

—
Eric Lease Morgan

Re: [CODE4LIB] code4lib mailing list [dlf]

2016-06-07 Thread Eric Lease Morgan

On May 12, 2016, at 8:30 AM, Eric Lease Morgan  wrote:

>> Alas, the Code4Lib mailing list software will most likely need to be 
>> migrated before the end of summer, and I’m proposing a number possible 
>> options for the lists continued existence...
> 
> Our list — Code4Lib — will be migrating to the Digital Library Federation 
> (DLF) sometime in the near future. 

This is a gentle reminder that the Code4Lib mailing list will be migrating to a 
different address sometime in the very near future. Specifically, it will be 
migrating to the Digital Library Federation. I suspect this work will be 
finished in less than thirty days, and when I know the exact address of the new 
list, I will share it here.

Thanks go to the DLF in general, and specifically Wayne Graham and Bethany 
Nowviskie for enabling this to happen. “Thanks!”

—
Eric Lease Morgan

Re: [CODE4LIB] code4lib mailing list [dlf]

2016-05-11 Thread Eric Lease Morgan

On Mar 24, 2016, at 10:29 AM, Eric Lease Morgan  wrote:

> Alas, the Code4Lib mailing list software will most likely need to be migrated 
> before the end of summer, and I’m proposing a number possible options for the 
> lists continued existence...

Our list — Code4Lib — will be migrating to the Digital Library Federation (DLF) 
sometime in the near future. [1] 

As I believe I alluded to previously, the University of Notre Dame (where 
Code4lib is currently being hosted) is discontinuing support for the venerable 
LISTSERV software. The University is offering two options: 1) doing nothing and 
letting lists die, or 2) migrating them to Google Groups. Neither of the 
options appealed to me. 

Through the process of making these issues public, Bethany Nowviskie and Wayne 
Graham — both of the DLF/CLIR — have graciously offered to host our mailing 
list. “Thank you, Wayne and Bethany!!” Sometime in the near future, I’m not 
exactly sure when, our mailing list's configurations will be copied from one 
host to another, and the address of our list will change to something like 
code4...@lists.clir.org. For better or for worse, the mailing list software 
will continue to be the venerable LISTSERV software. 

‘More later, as news makes itself available. FYI.

[1] DLF - https://www.diglib.org

—
Eric Lease Morgan
Artist- And Librarian-At-Large

“Lost In Rome”

[CODE4LIB] authority work with isni

2016-04-15 Thread Eric Lease Morgan

I am thinking about doing some authority work with content from ISNI, and I 
have a few questions about the resource.

As yo may or may not know, ISNI is a sort of authority database. [1] One can 
search for an identity in ISNI, identify a person of interest, get a key, 
transform the key into a URI, and use the URI to get back both human-readable 
and machine readable data about the person. For example, the following URIs 
return the same content in different forms:

  * human-readable - http://isni.org/isni/35046923
  * XML - http://isni.org/isni/35046923.xml

I discovered the former URI through a tiny bit of reading. [2] And I discovered 
the later URI through a simple guess. What other URIs exist?

When it comes to the authority work, my goal is to enhance authority records; 
to more thoroughly associate global identifiers with named entities in a local 
authority database. Once this goal is accomplished, the library catalog 
experience can be enhanced, and the door is opened for supporting linked data 
initiatives. In order to accomplish the goal, I believe I can:

  1. get a list of authority records
  2. search for name in a global authority database (like VIAF or ISNI)
  3. if found, then update local authority record accordingly
  4. go to Step #2 for all records
  5. done

My questions are:

  * What remote authority databases are available programmatically? I already 
know of one from the Library of Congress, VIAF, and probably WorldCat 
Identities. Does ISNI support some sort of API, and if so, where is some 
documentation?

  * I believe the Library Of Congress, VIAF, and probably WorldCat Identities 
all support linked data. Does ISNI, and if so, then how is it implemented and 
can you point me to documentation?

  * When it comes to updating the local (MARC) authority records, how do you 
suggest the updates happen? More specifically, what types of values do you 
suggest I insert into what specific (MARC) fields/subfields? Some people 
advocate $0 of 1xx, 6xx, and 7xx fields. Other people suggest 024 subfields 2 
and a. Inquiring minds would like to know.

Fun with authorities!? And, “What’s in a name anyway?"

[1] ISNI - http://isni.org
[2] some documentation - http://isni.org/how-isni-works

—
Eric Lease Morgan
Lost In Rome

Re: [CODE4LIB] Software used in Panama Papers Analysis [named entities]

2016-04-08 Thread Eric Lease Morgan

On Apr 8, 2016, at 5:13 PM, Jenn C  wrote:

> I worked on a text mining project last semester where I had a bunch of
> magazines with text that was totally unstructured (from IA). I would have
> really liked to know how to work entity matching into such a project. Are
> there text mining projects out there that demonstrate doing this?

If I understand your question correctly, then the Stanford Name Entity 
Recognition (NER) library/application may be one solution. [1]

Given text as input, a named entity recognition library/application returns a 
list of nouns (names, places, and things). The things can be all sorts of stuff 
such as organizations, dates, times, fiscal amounts, etc. Stanford’s NER is 
really a Java library, but has a command-line interface. Feed it a text, and 
you get back an XML stream. The stream contains elements, and each element is 
expected to be some sort of entity. Be forewarned. For the the best and most 
optimal performance, it is necessary to “train” the library/application. 
Frankly, I’ve never done that, and consequently, I guess I’ve never been 
optimal.* You also might want to take a read of the text from the Python 
Natural Language Toolkit (NLTK) module. [2] The noted chapter gives a pretty 
good overview of the subject. 

[1] NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] NLTK chapter - http://www.nltk.org/book/ch07.html

* ‘Story of my life.

—
Eric Lease Morgan

Re: [CODE4LIB] Software used in Panama Papers Analysis

2016-04-07 Thread Eric Lease Morgan

On Apr 7, 2016, at 4:24 PM, Gregory Markus  wrote:

>> from one of the New York Times stories on the Panama Papers: "The
>> ICIJ made a number of powerful research tools available to the
>> consortium that the group had developed for previous leak
>> investigations. Those included a secure, Facebook-type forum
>> where reporters could post the fruits of their research, as well
>> as database search program called “Blacklight” that allowed the
>> teams to hunt for specific names, countries or sources.”
>> 
>> http://www.nytimes.com/2016/04/06/business/media/how-a-cryptic-message-interested-in-data-led-to-the-panama-papers.html
> 
> https://ijnet.org/en/blog/how-icij-pulled-large-scale-cross-border-investigative-collaboration

Based on my VERY quick read of the articles linked above, a group of people 
created a collaborative system for collecting, indexing, searching, and 
analyzing data/information. In the end, they facilitated the creation of 
knowledge. That sure sounds like a library to me. Kudos! I believe our 
profession has many things to learn from this example, and two of those things 
include: 1) you need full text content, and 2) controlled vocabularies are not 
a necessary component of the system. —ELM

Re: [CODE4LIB] Google can give you answers, but librarians give you the right answers

2016-04-06 Thread Eric Lease Morgan

On Apr 6, 2016, at 12:44 PM, Jason Bengtson  wrote:

> This is librarians fighting a PR battle we can't win. I doubt most people
> care about these assertions, and I certainly don't think they stand a
> chance of swaying anyone. This is like the old "librarians need to promote
> themselves better" chestnut. Losing strategies, in my opinion. Rather than
> trying to refight a battle with search technology that search technology
> has already won, libraries and librarians need to reinvent the technology
> and themselves. Semantic technologies, in particular, provide Information
> Science with extraordinary avenues for reinvention. We need to make search
> more effective and approachable, rather than wagging our finger at people
> who we think aren't searching "correctly". In the short term, data provides
> powerful opportunities. And it isn't all about writing code or wrangling
> data . . . informatics, metadata, systematic reviews, all of these are
> fertile ground for additional development. Digitization projects and other
> efforts to make special collections materials broadly accessible are
> exciting stuff, as are the developing technologies that support those
> efforts. We should be seizing the argument and shaping it, rather than
> trying to invent new bromides to support a losing fight.

+1

I wholeheartedly concur. IMHO, the problem to solve now-a-days does not 
surround search because everybody can find plenty of stuff, and the stuff is 
usually more than satisfactory. Instead, I think the problem to solve surrounds 
assisting the reader in using & understanding the stuff they find. [1] “Now 
that I’ve done the ‘perfect’ search and downloaded the subsequent 200 articles 
from JSTOR, how — given my limited resources —- do I read and comprehend what 
they say? Moreover, how do I compare & contrast what the articles purport with 
the things I already know?” Text mining (a type of semantic technology) is an 
applicable tool here, but then again, “Whenever you have a hammer, everything 
begins to look like a nail."

[1] an essay elaborating on the idea of use & understand - 
http://infomotions.com/blog/2011/09/dpla/

—
Eric Lease Morgan
Artist- And Librarian-At-Large

Re: [CODE4LIB] Google can give you answers, but librarians give you the right answers

2016-04-06 Thread Eric Lease Morgan

On Apr 5, 2016, at 11:12 PM, Karen Coyle  wrote:

> Eric, there were studies done a few decades ago using factual questions. 
> Here's a critical round-up of some of the studies: 
> http://www.jstor.org/stable/25828215  Basically, 40-60% correct, but possibly 
> the questions were not representative -- so possibly the results are really 
> worse :(

Karen, interesting article, and thank you for bringing it to our attention. 
—Eric

Re: [CODE4LIB] Google can give you answers, but librarians give you the right answers

2016-04-05 Thread Eric Lease Morgan

 I sincerely wonder to what extent librarians give the reader
(patrons) the right -- correct -- answer to a (reference) question.
Such is a hypothesis that can be tested and measured. Please show me
non-antidotal evidence one way or the other. --ELM

Re: [CODE4LIB] code4lib mailing list [domain]

2016-03-27 Thread Eric Lease Morgan

On Mar 25, 2016, at 1:24 PM, Bethany Nowviskie  wrote:

> Dear all — I’ve been getting this as a digest, so apologies that I’m only 
> seeing the thread on the future of the mailing list now!
> 
> CLIR/DLF is running the same version of ye olde LISTSERV as Notre Dame, to 
> support DLF-ANNOUNCE, some of our working group lists, and (now) all of the 
> discussion lists of the National Digital Stewardship Alliance. 
> 
> We have recent experience migrating NDSA lists over from Library of Congress 
> — with archives and subscribers intact — and would be really happy to do the 
> same for Code4Lib. We could commit to supporting the list for the long haul, 
> as a contribution to this awesome community. 
> 
> It may be that people want to take the opportunity to get off LISTSERV 
> entirely, but if not — just say the word! — Bethany 
> 
> (PS: added gratitude to Eric from all of us at DLF as well.) 
> 
> — 
> Bethany Nowviskie 
> Director of the Digital Library Federation (DLF) at CLIR
> Research Associate Professor of Digital Humanities at UVa
> diglib.org | clir.org | ndsa.org | nowviskie.org

Bethany, yes, thank you. This is a very interesting offer. 

Everybody, please share with the rest of us your opinion about our mailing 
list’s domain. This need to move from the University of Notre Dame is a 
possible opportunity to have our list come from the cod4lib.org domain. For 
example, the address of the list might become code4...@lists.code4lib.org. If 
it moved to Google, then the address might be code4...@googlegroups.com. Is the 
list’s address important for us to brand?

—
ELM

Re: [CODE4LIB] code4lib mailing list

2016-03-24 Thread Eric Lease Morgan

Regarding the mailing list, here is what I propose to do:

  1. Upgrade my virtual server to include more RAM and disk space.
  2. Install and configure Mailman.
  3. Ask people to subscribe to a bogus list so I/we can practice.
  4. Evaluate.
  5. If evaluation is successful, then migrate subscribers to new list.
  6. If evaluation is unsuccessful, then look for alternatives.
  7. Re-evaulate on a regular basis. 

On my mark. Get set. Go?

—
ELM

[CODE4LIB] code4lib mailing list

2016-03-24 Thread Eric Lease Morgan

Alas, the Code4Lib mailing list software will most likely need to be migrated 
before the end of summer, and I’m proposing a number possible options for the 
lists continued existence. 

I have been managing the Code4Lib mailing list since its inception about twelve 
years ago. This work has been both a privilege and an honor. The list itself 
runs on top of the venerable LISTSERV application and is hosted by the 
University of Notre Dame. The list includes about 3,500 subscribers, and 
traffic very very rarely gets over fifty messages a day. But alas, University 
support for LISTSERV is going away, and I believe the University wants to 
migrate the whole kit and caboodle to Google Groups.

Personally, I don’t like the idea of Code4Lib moving to Google Groups. Google 
knows enough about me (us), and I don’t feel the need for them to know more. 
Sure, moving to Google Groups includes a large convenience factor, but it also 
means we have less control over our own computing environment, let alone our 
data.

So, what do we (I) do? I see three options:

  0. Let the mailing list die — Not really an option, in my opinion
  1. Use Google Groups - Feasible, (probably) reliable, but with less control
  2. Host it ourselves - More difficult, more responsibility, all but absolute 
control

Again, personally, I like Option #2, and I would probably be willing to host 
the list on my one of my computers, (and after a bit of DNS trickery) complete 
with a code4lib.org domain.

What do y’all think? If we go with Option #2, then where might we host the 
list, who might do the work, and what software might we use?

—
Eric Lease Morgan
Artist- And Librarian-At-Large

Re: [CODE4LIB] personalization of academic library websites

2016-03-23 Thread Eric Lease Morgan

On Mar 23, 2016, at 6:26 PM, Mark Weiler  wrote:

> I'm doing some exploratory research on personalization of academic library 
> websites. E.g. student logs in, the site presents books due dates, room 
> reservations, course list with associated course readings, subject 
> librarians.  For faculty members, the site might present other information, 
> such as how to put material on course reserves, deposit material into 
> institutional repository, etc.   Has anyone looked into this, or tried it?

I did quite a bit of work on this idea quite a number of years ago, measured in 
Internet time. See:

  MyLibrary@NCState (1999)
  http://infomotions.com/musings/sigir-99/

  The text describes MyLibrary@NCState, an extensible
  implementation of a user-centered, customizable interface to a
  library's collection of information resources. The system
  integrates principles of librarianship with globably networked
  computing resources creating a dynamic, customer-driven front-end
  to any library's set of materials. It supports a framework for
  libraries to provide enhanced access to local and remote sets of
  data, information, and knowledge. At the same, it does not
  overwhelm its users with too much information because the users
  control exactly how much information is displayed to them at any
  given time. The system is active and not passive; direct human
  interaction, computer mediated guidance and communication
  technologies, as well as current awareness services all play
  indispensible roles in its implementation. 

  MyLibrary: A Copernican revolution in libraries (2005)
  http://infomotions.com/musings/copernican-mylibrary/

  "We are suffering from information overload," the speaker said.
  "There is too much stuff to choose from. We want access to the
  world's knowledge, but we only want to see one particular part of
  it at any one particular time."... The speaker was part of a
  focus group at the North Carolina State University (NCSU),
  Raleigh, back in 1997... To address the issues raised in our
  focus groups, the NCSU Libraries chose to create MyLibrary, an
  Internet-based library service. It would mimic the commercial
  portals in functionality but include library content: lists of
  new books, access to the catalog and other bibliographic indexes,
  electronic journals, Internet sites, circulation services,
  interlibrary loan services, the local newspaper, and more. Most
  importantly, we designed the system to provide access to our most
  valuable resource: the expertise of our staff. After all, if you
  are using My Yahoo! and you have a question, then who are you
  going to call? Nobody. But if you are using a library and you
  have a question, then you should be able to reach a librarian.

  MyLibrary: A digital library framework & toolkit (2008)
  http://infomotions.com/musings/mylibrary-framework/

  This article describes a digital library framework and toolkit
  called MyLibrary. At its heart, MyLibrary is designed to create
  relationships between information resources and people. To this
  end, MyLibrary is made up of essentially four parts: 1)
  information resources, 2) patrons, 3) librarians, and 4) a set of
  locally-defined, institution-specific facet/term combinations
  interconnecting the first three. On another level, MyLibrary is a
  set of object-oriented Perl modules intended to read and write to
  a specifically shaped relational database. Used in conjunction
  with other computer applications and tools, MyLibrary provides a
  way to create and support digital library collections and
  services. Librarians and developers can use MyLibrary to create
  any number of digital library applications: full-text indexes to
  journal literature, a traditional library catalog complete with
  circulation, a database-driven website, an institutional
  repository, an image database, etc. The article describes each of
  these points in greater detail.

Technologically, the problem of personalization is not difficult. Instead, the 
problem I encountered in trying to make a thing like MyLibrary a reality were 
library professional ethics. Too many librarians thought the implementation of 
the idea challenged intellectual privacy. Alas.

—
Eric Lease Morgan
Artist- And Librarian—At-Large

(574) 485-6870

[CODE4LIB] worldcat discovery versus metadata apis

2016-03-22 Thread Eric Lease Morgan

I’m curious. What is the difference between the WorldCat Discovery and WorldCat 
Metadata APIs? 

Given an OCLC number, I want to programmatically search WorldCat and get in 
return a full bibliographic record compete with authoritative subject headings 
and names. Which API should I be using?

—
Eric Morgan

Re: [CODE4LIB] reearch project about feeling stupid in professional communication

2016-03-22 Thread Eric Lease Morgan

In my humble opinion, what we have here is a failure to communicate. [1]

Libraries, especially larger libraries, are increasingly made up of many 
different departments, including but not limited to departments such as: 
cataloging, public services, collections, preservation, archives, and 
now-a-days departments of computer staff. From my point of view, these various 
departments fail to see the similarities between themselves, and instead focus 
on their differences. This focus on the differences is amplified by the use of 
dissimilar vocabularies and subdiscipline-specific jargon. This use of 
dissimilar vocabularies causes a communications gap and left unresolved 
ultimately creates animosity between groups. I believe this is especially true 
between the more traditional library departments and the computer staff. This 
communications gap is an impediment to when it comes to achieving the goals of 
librarianship, and any library — whether it be big or small — needs to address 
these issues lest it wastes both its time and money.

For example, the definitions of things like MARC, databases & indexes, 
collections, and services are not shared across (especially larger) library 
departments.

What is the solution to these problems? In my opinion, there are many 
possibilities, but the solution ultimately rests with individuals willing to 
take the time to learn from their co-workers. It rests in the ability to 
respect — not merely tolerate — another point of view. It requires time, 
listening, discussion, reflection, and repetition. It requires getting to know 
other people on a personal level. It requires learning what others like and 
dislike. It requires comparing & contrasting points of view. It demands 
“walking a mile in the other person’s shoes”, and can be accomplished by things 
such as the physical intermingling of departments, cross-training, and simply 
by going to coffee on a regular basis.

Again, all of us working in libraries have more similarities than differences. 
Learn to appreciate the similarities, and the differences will become 
insignificant. The consequence will be a more holistic set of library 
collections and services.

[1] I have elaborated on these ideas in a blog posting - http://bit.ly/1LDpXkc

—
Eric Lease Morgan

[CODE4LIB] Code4Crotaia

2016-03-21 Thread Eric Lease Morgan

Code4Crotaia was alluded to in a blog posting. [1] code4crotaia++  Inquiring 
mind would like to know more. Please tell us about Code4Crotaia, and don’t 
hesitate to update http://wiki.code4lib.org with details?

[1] http://blog.okfn.org/2016/03/21/codeacross-opendataday-zagreb-2016/

—Eric Morgan

Re: [CODE4LIB] onboarding developers coming from industry

2016-03-02 Thread Eric Lease Morgan

On Mar 2, 2016, at 9:48 AM, LeVan,Ralph  wrote:

> …I've written so much bloat that didn't get used because a librarian was sure 
> the system would fail without it….

I’m ROTFL because just a few minutes ago, while composing an informal essay on 
the history of bibliographic description, I wrote the following sentence:

  The result is library jargon solidified in an obscure
  data structure. Moreover, in an attempt to make the
  surrogates of library collections more meaningful, the
  information of bibliographic description bloats to fill
   ^^
  much more than the traditional three to five catalog
  cards of the past.

levan++

—
ELM

Re: [CODE4LIB] onboarding developers coming from industry

2016-03-02 Thread Eric Lease Morgan

On Mar 2, 2016, at 9:30 AM, Tom Hutchinson  wrote:

> ...To be honest I feel like I still don’t even really know what libraries / 
> librarians are yet.


  Tom, when you find out, please tell the rest of us.  ;-)  —Eric Lease Morgan

Re: [CODE4LIB] introduction, and a fun date visualization

2016-02-14 Thread Eric Lease Morgan

On Feb 10, 2016, at 1:06 AM, Greg Lindahl  wrote:

> Hi! I'm a new employee of the Internet Archive, formerly a search
> engine guy, mostly working on search for the Wayback Machine. In my
> spare time I've been working on a visualization of dates and entities
> in scanned book contents. There's a blog post about it here:
> 
> https://blog.archive.org/2016/02/09/how-will-we-explore-books-in-the-21st-century/
> 
> And the demo itself is here:
> 
> https://books.archivelab.org/dateviz/
> 
> I'm going to be attending the Philly conference, and I'm looking
> forward to hearing from folks about other discovery tools driven
> by content or algorithmic metadata.
> 
> —
> greg


Yes, very cool. Thank you for bringing this to our attention.

>From my point of view, Greg, you have created an alternative and supplemental 
>index to one or more books. While printed books have a whole lot of utility, 
>digital books manifest a different sets of functionality. Imagine having a 
>digital book and then providing services against the text that go beyond find. 
>(“Blasphemy!”) One of the services would be graphing as you (literally) 
>illustrate above. Other services might be parts-of-speech analysis, definition 
>extraction, tabulations of additional named-entities, etc. While reading 
>fiction is many times intended for “just fun”, I believe these sorts of 
>services may make fiction more interesting as well as more accessible for 
>study. Again, thank you.

— 
Eric Lease Morgan

Re: [CODE4LIB] Don't Change Your Site Because of Reference Librarians RE: [CODE4LIB] Responsive website question

2016-02-08 Thread Eric Lease Morgan

On Feb 8, 2016, at 11:25 AM, Katherine Deibel  wrote:

> From a disability accessibility perspective, magnification is not purely 
> about text readability but making sure that all features of a 
> website---images, interactive widgets, text, etc.---are of use to the user. 
> Merely changing the font size is like putting out a fire in the kitchen while 
> the rest of the house is ablaze.

  deibel++  &  ROTFL  —ELM

Re: [CODE4LIB] Anyone familiar with XSLT? Im stuck

2016-01-21 Thread Eric Lease Morgan

> I have around 1400 xml files that I am trying to copy into one xml file so 
> that I can then pull out three elements from each and put into a single csv 
> file.

What are three elements you want to pull out of each XML file, and
what do you want the CSV file to look like?

Your XML files are pretty flat, and  if I understand the question
correctly, then it is all but trivial to extract your three elements
as a line of CSV.  Consequently I suggest foregoing the concatenation
of all the XML files into a single file. Such only adds complexity.
Instead I suggest:

 1. Put all XML files in a directory
 2. For each XML file, process with XSL
 3. Output a line of CSV
 4. Done

#!/bin/bash

# xml2cvs.sh - batch process a set of XML files

# configure (season the value of XSLTPROC to taste)
XSLTPROC=/usr/bin/xsltproc
XSLT=xml2csv.xsl

# process each file
for FILE in ./data/*.xml

 # do the work
 $XSLTPROC  $XSLT $FILE

end

# done
exit

 $ mkdir ./data
 $ cp *.xml ./data
 $ ./xml2csv.sh > data.csv
 $ open data.csv

Just about all that is missing is:

 * what elements do you want to extract, and
 * what do you want the CSV to look like

--
ELM

[CODE4LIB] oclc member code

2016-01-21 Thread Eric Lease Morgan

Given an OCLC member code, such as BXM for Boston College, is it possible to 
use some sort of OCLC API to search WorldCat (or some other database) and 
return information about Boston College? —Eric Lease Morgan

Re: [CODE4LIB] TEI->EPUB serialization testing

2016-01-14 Thread Eric Lease Morgan

On Jan 14, 2016, at 10:32 AM, Ethan Gruber  wrote:

>>> Part of this grant stipulates that open access books be made available
>>> in EPUB 3.0.1, so I got to work on a pipeline for dynamically serializing
>>> TEI into EPUB... 
>>> http://eaditor.blogspot.com/2015/12/the-ans-digital-library-look-under-hood.html
>>>  
>>> ...http://eaditor.blogspot.com/2016/01/first-ebook-published-to-ans-digital.html
>> 
>> I wrote a similar thing a number of years ago, and it was implemented as
>> Alex Lite. [1, 2]...
>> 
>> [1] Alex Lite blog posting - http://bit.ly/eazpJY
>> [2] Alex Lite - http://infomotions.com/sandbox/alex-lite/
> 
> Thanks, Eric. Is the original code online anywhere? I will eventually write
> some XSL:FO to generate PDFs for people who want those, for some reason.

I just put my source code and much of the supporting configuration files (XSL) 
temporarily on the Web at http://infomotions.com/tmp/alex-lite-code/  Enjoy? 
—ELM

Re: [CODE4LIB] TEI->EPUB serialization testing

2016-01-14 Thread Eric Lease Morgan

On Jan 13, 2016, at 4:17 PM, Ethan Gruber  wrote:

> Part of this grant stipulates that open access books be made available in 
> EPUB 3.0.1, so I got to work on a pipeline for dynamically serializing TEI 
> into EPUB. It works pretty well, but there are some minor issues. The issues 
> might be related more to differences between individual ereader apps in 
> supporting the 3.0.1 spec than anything I might have done wrong in the 
> serialization process (the file validates according to a script I've been 
> running)…
> 
> If you are interested in more information about the framework, there's 
> http://eaditor.blogspot.com/2015/12/the-ans-digital-library-look-under-hood.html
>  and 
> http://eaditor.blogspot.com/2016/01/first-ebook-published-to-ans-digital.html.
>  It's highly LOD aware and is capable of posting to a SPARQL endpoint so that 
> information can be accessed from other archival frameworks and integrated 
> into projects like Pelagios.

I wrote a similar thing a number of years ago, and it was implemented as Alex 
Lite. [1] I started out with TEI files, and then transformed them into a number 
of derivatives: simple HTML, “cooler” HTML, PDF, and ePub. I think my ePub 
version was somewhere around 2.0. The “framework” was written in Perl, of 
course.  ;-)  The whole of a Alex Lite was designed to be given away on CD or 
as an instant website. (“Just add water."). The hard part of the whole thing 
was the creation of the TEI files in the first place. After that, everything 
was relatively easy.

[1] Alex Lite blog posting - http://bit.ly/eazpJY
[2] Alex Lite - http://infomotions.com/sandbox/alex-lite/

—
Eric Lease Morgan
Artist- And Librarian-At-Large

(A man in a trench coat approaches, and says, “Psst. Hey buddy, wanna buy a 
registration to the Code4Lib conference!?”)

Re: [CODE4LIB] The METRO Fellowship

2016-01-05 Thread Eric Lease Morgan

On Jan 5, 2016, at 1:17 PM, Nate Hill  wrote:

>  metro.org/fellowship
> 
> Our goal is to empower a small cohort of fellows to help solve
> cross-institutional problems and to spur innovation within our membership
> of libraries and archives in NYC and Westchester County as well as the
> field at large.

 Cool idea!!! —ELM

Re: [CODE4LIB] selinux [resolved]

2015-12-27 Thread Eric Lease Morgan

On Dec 27, 2015, at 8:29 AM, Michael Berkowski  wrote:

>> How do I modify the permissions of a file under the supervision of SELunix
>> so the file can be executed as a CGI script?
>> 
>> I have two CGI scripts designed to do targeted crawls against remote
>> hosts. One script uses rsync on port 873 and the other uses wget on port
>> 443. I can run these scripts as me without any problems. None. They work
>> exactly as expected. But when the scripts are executed from my HTTP server
>> and under the user apache both rsync and wget fail. I have traced the
>> errors to some sort of permission problems generated from SELinux.
> 
> /usr/sbin/semanage and some other necessary things come from the package
> policycoreutils-python
> 
> By default, Apache is disallowed from making outbound network connections
> and there's an SELinux boolean to enable it (examples here
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Booleans-Configuring_Booleans.html)
> 
> This is probably the most common thing anyone needs to change in SELinux.
> 
> $ setsebool -P httpd_can_network_connect on
> 
> (-P is to make it persist beyond reboots) As far as the wget, that setting
> alone may be enough to cure it, provided the  CGI script itself lives in a
> location Apache expects, which would already have the right context.
> Although both produce tcp errors, I'm not so certain it will also correct
> the rsync one.
> 
> To dig further, there are several actions you can take.
> 
> If something has the wrong context and you need to find out what the right
> context should be, you can list the relevant contexts along with the
> filesystem locations they're bound to with:
> 
> # list Apache-related contexts...
> $ semanage fcontext -l | grep httpd
> 
> You probably already know how to change one:
> 
> $ chcon -t new_context_name /path/to/file
> 
> It doesn't look like you got any denials related to CGI execution, so I
> would guess your scripts are where Apache expects them.
> 
> To list all Apache booleans and their states, use
> 
> $ getsebool -a | grep httpd
> 
> If you are unable to get your result using booleans or fixing the context,
> then you have to start digging into audit2allow. It will take denial lines
> from the audit log like those in your email from stdin and attempt to
> diagnose solutions with booleans, or help create a custom SELinux module to
> allow whatever you are attempting.
> 
> Start by grepping the relevant denied lines from /var/log/audit/audit.log,
> or get them from wherever you got the ones in your message. I usually put
> them into a file. Don't take every denial from the log, only the ones
> generated by the action you're trying to solve.
> 
> $ audit2allow < grepped_denials.txt
> 
> There may also be audit2why, but I don't know if CentOS6 has it and I've
> never used it.
> 
> Not sure if CentOS 6 has the updated tools which actually suggest booleans
> you can modify to fix denials, but if it does, you would get output like:
> 
> #= httpd_t ==
> 
> # This avc can be allowed using the boolean 'httpd_run_stickshift'
> allow httpd_t self:capability fowner;
> 
> # This avc can be allowed using the boolean 'httpd_execmem'
> allow httpd_t self:process execmem;
> 
> 
> If there are no booleans to modify, audit2allow will output policy
> configuration which would enable your action. Your last resort is to create
> a custom SELinux module with the -M flag that implements that policy.
> 
> # generate the module
> $ audit2allow -M YOURMODULENAME < grepped_denials.txt
> 
> Then you have to install the module
> 
> $ semodule -i YOURMODULENAME.pp
> 
> There may simpler ways of going about the module creation, but I do it so
> infrequently and this is the method I'm accustomed to. Red Hat has some
> docs here:
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Fixing_Problems-Allowing_Access_audit2allow.html
> 
> So, I hope this gets you somewhere useful. In the best case scenario, you
> should only need to enable httpd_can_network_connect.
> 
> — 
> Michael Berkowski
> University of Minnesota Libraries


Michael, resolved, and thank you for the prompt and thorough reply.

Yes, SELinux was doing its job, and it was configured to disallow network 
connections from httpd. After issuing the following command (which allows httpd 
to make network connections) both my rsync- and wget-based CGI scripts worked 
without modification:

  setsebool http_can_network_connect on

Maybe I’ll add the -P option later. Yippie! Thank you. 

— 
Eric Lease Morgan

Re: [CODE4LIB] selinux

2015-12-26 Thread Eric Lease Morgan

On Dec 26, 2015, at 8:14 PM, Childs, Riley  wrote:

>> How do I modify the permissions of a file under the supervision of SELunix
>> so the file can be executed as a CGI script?
>> 
>> I have two CGI scripts designed to do targeted crawls against remote
>> hosts. One script uses rsync on port 873 and the other uses wget on port
>> 443. I can run these scripts as me without any problems. None. They work
>> exactly as expected. But when the scripts are executed from my HTTP server
>> and under the user apache both rsync and wget fail. I have traced the
>> errors to some sort of permission problems generated from SELinux.
>> Specifically, SELinux generates the following errors for the rsync script:
>> 
>>  type=AVC msg=audit(1450984068.685:19667): avc:  denied  {
>>  name_connect } for  pid=11826 comm="rsync" dest=873
>>  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
>>  tcontext=system_u:object_r:rsync_port_t:s0 tclass=tcp_socket
>> 
>>  type=SYSCALL msg=audit(1450984068.685:19667): arch=c03e
>>  syscall=42 success=no exit=-13 a0=3 a1=1b3c030 a2=10
>>  a3=7fffb057acc0 items=0 ppid=11824 pid=11826 auid=500 uid=48
>>  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
>>  tty=(none) ses=165 comm="rsync" exe="/usr/bin/rsync"
>>  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)
>> 
>> SELinux generates these errors for the wget script:
>> 
>>  type=AVC msg=audit(1450984510.396:19715): avc:  denied  {
>>  name_connect } for  pid=13263 comm="wget" dest=443
>>  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
>>  tcontext=system_u:object_r:http_port_t:s0 tclass=tcp_socket
>> 
>>  type=SYSCALL msg=audit(1450984510.396:19715): arch=c03e
>>  syscall=42 success=no exit=-13 a0=4 a1=7ffe1d05b890 a2=10
>>  a3=7ffe1d05b4f0 items=0 ppid=13219 pid=13263 auid=500 uid=48
>>  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
>>  tty=(none) ses=165 comm="wget" exe="/usr/bin/wget"
>>  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)
>> 
>> How do I diagnose these errors? Do I need to use something like chcon to
>> change my CGI scripts’ permissions? Maybe I need to use chcon to change
>> rsync’s or wget’s permissions? Maybe I need to use something like semanage
>> (which doesn’t exist on my system) to change the user apache’s permissions
> 
> SELinux :) Which distro are you running?

  I am running CentOS release 6.7. —ELM

[CODE4LIB] selinux

2015-12-26 Thread Eric Lease Morgan

How do I modify the permissions of a file under the supervision of SELunix so 
the file can be executed as a CGI script?

I have two CGI scripts designed to do targeted crawls against remote hosts. One 
script uses rsync on port 873 and the other uses wget on port 443. I can run 
these scripts as me without any problems. None. They work exactly as expected. 
But when the scripts are executed from my HTTP server and under the user apache 
both rsync and wget fail. I have traced the errors to some sort of permission 
problems generated from SELinux. Specifically, SELinux generates the following 
errors for the rsync script:

  type=AVC msg=audit(1450984068.685:19667): avc:  denied  {
  name_connect } for  pid=11826 comm="rsync" dest=873
  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
  tcontext=system_u:object_r:rsync_port_t:s0 tclass=tcp_socket

  type=SYSCALL msg=audit(1450984068.685:19667): arch=c03e
  syscall=42 success=no exit=-13 a0=3 a1=1b3c030 a2=10
  a3=7fffb057acc0 items=0 ppid=11824 pid=11826 auid=500 uid=48
  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
  tty=(none) ses=165 comm="rsync" exe="/usr/bin/rsync"
  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)

SELinux generates these errors for the wget script:

  type=AVC msg=audit(1450984510.396:19715): avc:  denied  {
  name_connect } for  pid=13263 comm="wget" dest=443
  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
  tcontext=system_u:object_r:http_port_t:s0 tclass=tcp_socket

  type=SYSCALL msg=audit(1450984510.396:19715): arch=c03e
  syscall=42 success=no exit=-13 a0=4 a1=7ffe1d05b890 a2=10
  a3=7ffe1d05b4f0 items=0 ppid=13219 pid=13263 auid=500 uid=48
  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
  tty=(none) ses=165 comm="wget" exe="/usr/bin/wget"
  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)

How do I diagnose these errors? Do I need to use something like chcon to change 
my CGI scripts’ permissions? Maybe I need to use chcon to change rsync’s or 
wget’s permissions? Maybe I need to use something like semanage (which doesn’t 
exist on my system) to change the user apache’s permissions?

This is a level of the operating system of which I am unfamiliar. 

— 
Eric Lease Morgan

Re: [CODE4LIB] yaml/xml/json, POST data, bloodcurdling terror

2015-12-17 Thread Eric Lease Morgan

On Dec 17, 2015, at 8:22 AM, Andromeda Yelton  
wrote:

> I strongly recommend this hilarious, terrifying PyCon talk about
> vulnerabilities in yaml, xml, and json processing:
> 
>   https://www.youtube.com/watch?v=kjZHjvrAS74
> 
> If you process user-submitted data in these formats and don't yet know why
> you should be flatly terrified, please watch this ASAP; it's illuminating.
> If you *do* know why you should be terrified, watch it anyway and giggle
> along in knowing recognition, because the talk is really very funny.

Obviously, the sorts of things outlined in the presentation above are real, and 
they are really scary. Us developers need to take note: getting input from the 
‘Net can be a really bad thing. —Eric Lease Morgan

Re: [CODE4LIB] dublin core files [and unicorns]

2015-11-27 Thread Eric Lease Morgan

On Nov 24, 2015, at 8:20 PM, Eric Lease Morgan  wrote:

>>> Do Dublin Core files exist, and if so, then can somebody show me one? Put 
>>> another way, can you point me to a DTD or schema denoting Dublin Core XML? 
>>> The closest I can come is the standard/default oai_dc description of an 
>>> OAI-PMH item.
>> 
>> On Nov 24, 2015, at 8:11 PM, Benjamin Florin  
>> wrote:
>> 
>> Sometimes the Dublin Core documentation uses "Dublin Core record" to
>> describe XML records that use Dublin core vocabulary, for example:
>> http://dublincore.org/documents/2003/04/02/dc-xml-guidelines/
>> 
>> Those records do use the Simple and Qualified Dublin Core XML Schema <
>> http://dublincore.org/schemas/xmls/>, which basically layout a list of
>> simple elements with DC labels that may contain strings and possibly a
>> language attribute.
> 

> From one of the links above I see a viable schema:
> 
> http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd
> 
> And yes, I haven’t seen any Dublin Core records “in the wild” either, but 
> based on the information above, they apparently can exist. Thank you.

I take back what I said earlier. Dublin Core records don’t exist, and I would 
like to re-enforce what was said by Benjamin, "Sometimes the Dublin Core 
documentation uses 'Dublin Core record' to describe XML records that use Dublin 
core vocabulary.” In this vane, I think think Dublin Core records are similar 
to unicorns, and I wish Library Land would stop alluding to them.

Benjamin points to as many as three different XML schema describing the 
implementation of Dublin Core:

 1. http://dublincore.org/schemas/xmls/simpledc20021212.xsd
 2. http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd
 3. http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd

None of these schema define a root element, and therefore it not possible to 
create an XML file that both: 1) validates against any of the schema, and 2) 
does not declare another schema to contain the Dublin Core data. If a given XML 
file does validate then it will not validate against the Dublin Core schema but 
instead the additional schema. An XML file must have one and only one root 
element, and the schemas listed above do not define root elements. 

One of my students identified a number of ways Dublin Core data could be 
embedded in HTML [1], but again, such files are not Dublin Core files. Instead, 
they are HTML files.

The idea of a “Dublin Core record” probably stems from the idea of a “MARC 
record” which is bad in and of itself. For example, how many times have you 
seen a delimited version of MARC called a ‘MARC record’? The idea of a "Dublin 
Core record” seems detrimental the understanding of what Dublin Core is an how 
it is implemented. 

Dublin Core is a set of element names coupled with very loose definitions of 
what those names are to contain and how they are to be applied. 

To what degree am I incorrect? What am I missing something?

[1] DC-HTML - http://dublincore.org/documents/dc-html/

—
Eric Lease Morgan
Artist- And Librarian-At-Large

Re: [CODE4LIB] dublin core files

2015-11-24 Thread Eric Lease Morgan

On Nov 24, 2015, at 8:11 PM, Benjamin Florin  wrote:

> Sometimes the Dublin Core documentation uses "Dublin Core record" to
> describe XML records that use Dublin core vocabulary, for example:
> http://dublincore.org/documents/2003/04/02/dc-xml-guidelines/
> 
> Those records do use the Simple and Qualified Dublin Core XML Schema <
> http://dublincore.org/schemas/xmls/>, which basically layout a list of
> simple elements with DC labels that may contain strings and possibly a
> language attribute.
> 
> I don't imagine "Dublin Core records" exist in the wild, but I have seen
> records that include the DC XML Schemas and to make use of the DC namespace.


Cool. From one of the links above I see a viable schema:

  http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd

And yes, I haven’t seen any Dublin Core records “in the wild” either, but based 
on the information above, they apparently can exist. Thank you.

—
ELM

[CODE4LIB] dublin core files

2015-11-24 Thread Eric Lease Morgan

What in the world are “Dubin Core files”? 

I’m teaching an online XML class to library school graduate students. The 
previous instructors of the class have asked the students to create “Dubin Core 
files” used to describe the content of things like TEI files. For the life of 
me, I can not figure out what a Dublin Core file is. To my mind, Dubin Core is 
all about a set of 15 (or so) names/labels used to more or less describe stuff. 
But people speak as if there is a such a thing as a Dublin Core (XML) file.

Do Dublin Core files exist, and if so, then can somebody show me one? Put 
another way, can you point me to a DTD or schema denoting Dublin Core XML? The 
closest I can come is the standard/default oai_dc description of an OAI-PMH 
item.

P.S. Showing me how to incorporate Dublin Core into HTML doesn’t count. Such 
are not Dublin Core files as much as they are HTML files.

— 
Eric Lease Morgan
Artist- and Librarian-At-Large

[CODE4LIB] bibframe

2015-10-15 Thread Eric Lease Morgan

[Forwarded upon request. —E “Lost In Venice” M ]


> From: "Fultz, Tamara" 
> Subject: Question about posting
> Date: October 15, 2015 at 12:43:08 AM GMT+2
> To: "code4lib-requ...@listserv.nd.edu" 
>  
> Implementing BIBFRAME
> The UC Davis BIBFLOW Project
> Presented by the New York Technical Services Librarians
> 
> With a focus on cataloging, Xiaoli Li will present the UC Davis BIBFLOW 
> project with a preview of its linked data cataloging tools and workflows. 
> This project was designed to examine how BIBFRAME can be adopted and how it 
> will affect daily library operations.
> 
> Date:
> Monday, November 16, 2015
> 5:00 – 7:45 PM
> Refreshments: 5 – 6 PM
> Program 6 – 7:45 PM
> 
> Location:
> The New York Public Library, Stephen A. Schwarzman Building
> Margaret Liebman Berger Forum, Room 227
> 476 Fifth Avenue (at 42nd Street)
> New York, NY 10018
> 
> $15 for current members
> $30 for event + new or renewed membership
> $20 for event + new or renewed student membership
> $40 for non-members
> 
> View more information and register at 
> http://nytsl.org/nytsl/implementing-bibframe-the-uc-davis-bibflow-project/ 
> 
>  
> —
> Tamara Lee Fultz
> Associate Museum Librarian
> Thomas J. Watson Library
> Metropolitan Museum of Art
> 212-650-2443
> tamara.fu...@metmuseum.org

[CODE4LIB] code4lib chicago meeting

2015-10-05 Thread Eric Lease Morgan

A Code4Lib Chicago meeting has been scheduled for Monday, November 23 from 8:30 
to 5 o’clock at the University of Illinois-Chicago. [1] Sign up early. Sign up 
often.

[1] meeting - http://wiki.code4lib.org/Code4Lib_Chicago

—
Eric Lease Morgan, Librarian-At-Large

Re: [CODE4LIB] code4lib chicago

2015-09-02 Thread Eric Lease Morgan

On Sep 2, 2015, at 12:04 PM, Cary Gordon  wrote:

> http://cod4lib.com

 ROTFL!!! —Eric Morgan

Re: [CODE4LIB] "coders for libraries"

2015-09-01 Thread Eric Lease Morgan

On Sep 1, 2015, at 9:42 AM, Eric Hellman  wrote:

> As someone who feels that Code4Lib should welcome people who don't 
> particularly identify as "coders", I would welcome a return to the previous 
> title attribute.

  1++ because I believe it is more about libraries than it is about code.  —ELM

Re: [CODE4LIB] Code4libBC (Vancouver, BC) - save the date November 26/27. 2015

2015-09-01 Thread Eric Lease Morgan

On Aug 31, 2015, at 9:23 PM, Cary Gordon  wrote:

> Perhaps this belongs on the Cod4lib list.
 ^^^

Yesterday, I didn’t quite understand the allusion to the East Coast, but now I 
see that I lost an e in Code4Lib. Cod4Lib. That's pretty funny. Thanks!  :-D  
—Earache

Re: [CODE4LIB] code4lib chicago

2015-08-31 Thread Eric Lease Morgan

On Aug 28, 2015, at 11:56 AM, Allan Berry  wrote:

> The UIC Library would be happy to host the Code4Lib event, in November or 
> early December.

The folks at University of Illinois-Chicago would like to sponsor a one-day 
Cod4Lib event, and in order to determine the best date, they are asking folks 
to complete the following Doodle Poll:

  http://doodle.com/45aukez6z6pyav62

Code4Lib events are great ways to meet people doing the same work you are doing 
to discuss common problems and solutions. Chicago is large and central. Fill 
out the Poll. Come to Chicago. Invigorate your professional life.

—
Eric Morgan

Re: [CODE4LIB] Code4Lib 2016: Philadelphia - Save the Date [url]

2015-08-10 Thread Eric Lease Morgan

On Aug 10, 2015, at 11:38 AM, David Lacy  wrote:

> The 2016 conference will be held from March 7 through March 10 in the Old 
> City District of Philadelphia.  This location puts conference attendees 
> within easy walking distance of many of Philadelphia’s historical treasures, 
> including Independence Hall, the Liberty Bell, the Constitution Center, and 
> the house where Thomas Jefferson drafted the Declaration of Independence. 
> Attendees will also be a very short distance from the Delaware River 
> waterfront and will be a short walk from numerous excellent restaurants.


  Cool! Is there an official Code4Lib 2016 Annual Meeting URL, and if so, then 
what is it? —Eric Morgan

[CODE4LIB] code4lib chicago (chicode4lib)

2015-07-29 Thread Eric Lease Morgan

As some of you in & around Chicago may or may not know, there is a Code4Lib 
Chicago group called chicode4lib. See the Google Group:

  https://groups.google.com/forum/#!forum/chicode4lib

I’m simply trying to drum up business for the community.

— 
Eric Lease Morgan

Re: [CODE4LIB] survey of image analysis packages

2015-07-25 Thread Eric Lease Morgan

On Jul 22, 2015, at 6:49 PM, Peter Mangiafico  wrote:

> I am conducting a survey of software used for image analysis and metadata 
> enhancement.  Examples include facial recognition, object identification, 
> similarity matching, and so on.  The goal is to understand if it is possible 
> to use algorithmic techniques to improve discoverability in a large dataset 
> consisting mostly of images.  The main project I am working on is automobile 
> history (http://revslib.stanford.edu 
> >) but the 
> techniques can of course be applied much more widely.  I'm interested in a 
> broad sweep, could be open source, commercial, service model, API, etc.  If 
> you have projects you are aware of, or tools you have used or heard about, 
> and wouldn't mind sending me an email, I'd appreciate it!


Alas, I do not have anything to contribute to your survey, but I sure would 
like to see the results. I believe image analysis of this sort is something to 
be taken advantage of in libraries. ‘Looking forward. —Eric Morgan

[CODE4LIB] jstor workset browser

2015-07-11 Thread Eric Lease Morgan

I have begun working on a suite of software designed to enable a person to 
“read” the full text of hundreds (if not a thousand) articles from JSTOR 
simultaneously, and I call this software the JSTOR Workset Browser. [1]

Using JSTOR’s Data For Research service, it is possible for anybody to first 
search & browse the totality of JSTOR. [2] The reader is then able to create 
and download a “dataset” describing found items of interest. This dataset 
includes a citations.xml file. The Browser takes this citations.xml file as 
input and then: 1) harvests the content, 2) indexes it, 3) does some analysis 
against the content, 4) creates a few graphs illustrating characteristics of 
the dataset, and finally 5) generates a browsable “catalog” in the form of an 
HTML table. The table includes columns for things like authors, titles, dates 
as well as page lengths, number of words, and coefficients denoting the use of 
color words, “big” names, and “great” ideas. In the near future the Browser 
will support search as well as the generation of a report describing each 
reader-generated (curated) collection. You can see a number of collections 
created to date, including writings about Thoreau, E!
 merson, Dickinson, Longfellow, and Poe. [3]

Combined with similar tools designed to work against the HathiTrust and/or 
EEBO-TCP, the ultimate goal is to enable students and scholars to easily do 
research against massive amounts of content quickly and easily. [4, 5]

I’m looking for additional sample content. If you create a dataset from DFR, 
then send me the citations.xml file, and I will use it as input for the 
Browser. “Wanna play?”


[1] Browser on GitHub - http://bit.ly/jstor-workset-browser
[2] Data For Research - http://dfr.jstor.org
[3] sample collections - http://dh.crc.nd.edu/sandbox/jstor-workset-browser/
[4] HathiTrust Workset Browser - 
https://github.com/ericleasemorgan/HTRC-Workset-Browser
[5] EEBO-TCP Workset Browser - 
https://github.com/ericleasemorgan/EEBO-TCP-Workset-Browser


—
Eric Lease Morgan, Librarian

[CODE4LIB] position description

2015-07-09 Thread Eric Lease Morgan

Below is a position description, posted by request:

  East Carolina University’s Joyner Library is seeking to fill the
  position of Head of the Application & Digital Services (ADS) department.
  The ADS team has a track record of implementing open source solutions
  and developing custom applications. This team works closely with
  personnel across ECU Libraries to develop, manage, and support
  large-scale applications, such as the Digital Collections repository,
  the ScholarShip Institutional Repository, and the Blacklight library
  catalog discovery layer. The successful candidate will provide
  leadership and vision for the ADS department and ensure departmental
  goals are met.

  The Head’s primary roles will be as a project manager for new and
  existing development projects and manager of the staff in the department
  (currently 6 people). Knowledge of and ability to institute and
  communicate a development process to analyze, design, develop,
  implement, and evaluate each project will be critical. Key skills in the
  position will be effective communication and decision making, as well as
  the ability to work with stakeholders to maintain and improve tools and
  interfaces.

  Additionally, this person will collaborate with colleagues to establish
  and manage metrics for measuring, analyzing, and optimizing user
  satisfaction. An important role for the Head of ADS will be to monitor
  trends in web design/development within the library environment and plan
  strategically to implement innovative changes for the Libraries. Also
  important is the responsibility of setting priorities, ensuring
  professional growth of the members of the department, and managing
  department activities to assure the best use of time and resources to
  meet project-defined objectives and deliverables.


For more detail, see: http://www.ecu.edu/cs-lib/about/job942036.cfm

—
Eric Lease Morgan

[CODE4LIB] eebo-tcp workset browser

2015-06-20 Thread Eric Lease Morgan

I have put on GitHub a thing I call the EEBO-TCP Workset Browser. [1] From the 
README file:

  The EEBO-TCP Workset Browser is a suite of software designed to support
  "distant reading" against the corpus called the Early English Books
  Online - Text Creation Partnership corpus. Using the Browser it is
  possible to: 1) search a "catalog" of the corpus's metadata, 2) create a
  list of identifiers representing a subset of content for study, 3) feed
  the identifiers to a set of files which will mirror the content locally,
  index it, and do some rudimentary analysis outputting as set of HTML
  files, structured data, and graphs. The reader is then expected to
  examine the output more "closely" (all puns intended) using their
  favorite Web browser, text editor, spreadsheet, database, or statistical
  application. The purpose and functionality of this suite is very similar
  to the purpose and functionality of HathiTrust Research Center Workset
  Browser.

[1] EBO-TCP Workset Browser - 
https://github.com/ericleasemorgan/EEBO-TCP-Workset-Browser

—
Eric Lease Morgan, Librarian
University of Notre Dame

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Eric Lease Morgan

On Jun 18, 2015, at 12:02 PM, Matt Sherman  wrote:

> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand.  I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment.  I would appreciate any suggestions for approaches or tools to
> look into.  Thanks for any help/thoughts people can give.

If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

— 
Eric

[CODE4LIB] eebo-tcp "browser"

2015-06-11 Thread Eric Lease Morgan

Much like my HathiTrust Research Center Workset Browser, I have been able to 
create a (fledgling) “browser” against the EEBO-TCP content:

  I have begun creating a “browser” against content from EEBO-TCP
  in the same way I have created a browser against worksets from
  the HathiTrust. The goal is to provide “distant reading” services
  against subsets of the Early English poetry and prose. You can
  see these fledgling efforts against a complete set of Richard
  Baxter’s works. Baxter was an English Puritan church leader,
  poet, and hymn-writer. [1, 2, 3]...

  The EEBO-TCP Workset Browser is not as mature as my HathiTrust
  Workset Browser, but it is coming along. [15] Next steps include:
  calculating an integer denoting the number of pages in an item,
  implementing a Web-based search interface to a subset’s full text
  as well as metadata, putting the source code (written in Python
  and Bash) on GitHub. After that I need to: identify more robust
  ways to create subsets from the whole of EEBO, provide links to
  the raw TEI/XML as well as HTML versions of items, implement
  quite a number of cosmetic enhancements, and most importantly,
  support the means to compare & contrast items of interest in each
  subset. Wish me luck?

   1. Richard Baxter (the person) – http://en.wikipedia.org/wiki/Richard_Baxter
   2. Richard Baxter (works) – http://bit.ly/ebbo-browser-baxter-works
   3. Richard Baxter (analysis of works) – 
http://bit.ly/eebo-browser-baxter-analysis
  15. HathiTrust Workset Browser – 
https://github.com/ericleasemorgan/HTRC-Workset-Browser


For more detail, please see the blog posting — 
http://bit.ly/emorgan-eebo-browser

Fun with well-structured data, open access content, and the definition of 
librarianship?

—
Eric Lease Morgan
University of Notre Dame

[CODE4LIB] hathitrust research center user group meeting [tomorrow (thursday)]

2015-06-10 Thread Eric Lease Morgan

Consider participating in a conference call (tomorrow, Thursday) on the topic 
of the HathiTrust Research Center.

A HathiTrust Research Center User’s Group Meeting is scheduled for tomorrow 
(Thursday), June 11 from 3-4 o’clock-ish:

Who - anybody and everybody
   What - a discussion of all things HathiTrust Research Center
   When - Thursday, June 11 from 3-4:00 Eastern Time
  Where - via the telephone: (812) 856-3600 or (317) 278-7008 with PIN 803140#
Why - because both you and they have something to offer librarianship

More specifically, Thursday's conference call is about at least two things: 1) 
your concerns regarding the Center, and 2) a discussion of my fledgling 
"Workset Browser". [1, 2] This is an opportunity for you to learn the why's & 
wherefore's of the Center, as well as influence the direction of programming 
initiatives. For example, you can learn more about the Center's authorization 
and copyright restrictions. You can also discuss how you think the Center can 
provide support for the digital humanities and text mining. 

[1] HathiTrust Research Center - http://hathitrust.org/htrc
[2] blog posting describing the "Browser" -  
http://blogs.nd.edu/emorgan/2015/05/htrc-workset-browser/

—
Eric Lease Morgan
University of Notre Dame

Re: [CODE4LIB] eebo [perfect texts]

2015-06-08 Thread Eric Lease Morgan

On Jun 8, 2015, at 7:32 AM, Owen Stephens  wrote:

> I’ve just seen another interesting take based (mainly) on data in the 
> TCP-EEBO release:
> 
>   
> https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
> 
> It includes mention of MorphAdorner[1] which does some clever stuff around 
> tagging parts of speech, spelling variations, lemmata etc. and another tool 
> which I hadn’t come across before AnnoLex[2] "for the correction and 
> annotation of lexical data in Early Modern texts”.
> 
> This paper[3] from Alistair Baron and Andrew Hardie at the University of 
> Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis 
> may also be of interest, and the team at Lancaster have developed a tool 
> called VARD which supports pre-processing texts[4]
> 
> [1] http://morphadorner.northwestern.edu
> [2] http://annolex.at.northwestern.edu
> [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
> [4] http://ucrel.lancs.ac.uk/vard/about/

All of this is really very interesting. Really. At the same time, there seems 
to be a WHOLE lot of effort spent on cleaning and normalizing data, and very 
little done to actually analyze it beyond “close reading”. The final goal of 
all these interfaces seem to be refined search. Frankly, I don’t need search. 
And the only community who will want this level of search will be the scholarly 
scholar. “What about the undergraduate student? What about the just more than 
casual reader? What about the engineer?” Most people don’t know how or why 
parts-of-speech are important let alone what a lemma is. Nor do they care. I 
can find plenty of things. I need (want) analysis. Let’s assume the data is 
clean — or rather, accept the fact that there is dirty data akin to the dirty 
data created through OCR and there is nothing a person can do about it — lets 
see some automated comparisons between texts. Examples might include:

  * this one is longer
  * this one is shorter
  * this one includes more action
  * this one discusses such & such theme more than this one
  * so & so theme came and went during a particular time period
  * the meaning of this phrase changed over time
  * the author’s message of this text is…
  * this given play asserts the following facts
  * here is a map illustrating where the protagonist went when
  * a summary of this text includes…
  * this work is fiction
  * this work is non-fiction
  * this work was probably influenced by…

We don’t need perfect texts before analysis can be done. Sure, perfect texts 
help, but they are not necessary. Observations and generalization can be made 
even without perfectly transcribed texts. 

—
ELM

Re: [CODE4LIB] eebo [developments]

2015-06-07 Thread Eric Lease Morgan

Here some of developments with my playing with the EEBO data. 

I used the repository on Box to get my content, and I mirrored it locally. [1, 
2] I then looped through the content using XPath to extract rudimentary 
metadata, thus creating a “catalog” (index). Along the way I calculated the 
number of words in each document and saved that as a field of each "record". 
Being a tab-delimited file, it is trivial to import the catalog into my 
favorite spreadsheet, database, editor, or statistics program. This allowed me 
to browse the collection. I then used grep to search my catalog, and save the 
results to a file. [5] I searched for Richard Baxter. [6, 7, 8]. I then used an 
R script to graph the numeric data of my search results. Currently, there are 
only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these 
graphs I can tell that Baxter wrote a lot of relatively short things, and I can 
easily see when he published many of his works. (He published a lot around 1680 
but little in 1665.) I then transformed the search result!
 s into a browsable HTML table. [13] The table has hidden features. (Can you 
say, “Usability?”) For example, you can click on table headers to sort. This is 
cool because I want sort things by number of words. (Number of pages doesn’t 
really tell me anything about length.) There is also a hidden link to the left 
of each record. Upon clicking on the blank space you can see subjects, 
publisher, language, and a link to the raw XML. 

For a good time, I then repeated the process for things Shakespeare and things 
astronomy. [14, 15] Baxter took me about twelve hours worth of work, not 
counting the caching of the data. Combined, Shakespeare and astronomy took me 
less than five minutes. I then got tired.

My next steps are multi-faceted and presented in the following incomplete 
unordered list:

  * create browsable lists - the TEI metadata is clean and
consistent. The authors and subjects lend themselves very well to
the creation of browsable lists.

  * CGI interface - The ability to search via Web interface is
imperative, and indexing is a prerequisite.

  * transform into HTML - TEI/XML is cool, but…

  * create sets - The collection as a whole is very interesting,
but many scholars will want sub-sets of the collection. I will do
this sort of work, akin to my work with the HathiTrust. [16]

  * do text analysis - This is really the whole point. Given the
full text combined with the inherent functionality of a computer,
additional analysis and interpretation can be done against the
corpus or its subsets. This analysis can be based the counting of
words, the association of themes, parts-of-speech, etc. For
example, I plan to give each item in the collection a colors,
“big” names, and “great” ideas coefficient. These are scores
denoting the use of researcher-defined “themes”. [17, 18, 19] You
can see how these themes play out against the complete writings
of “Dead White Men With Three Names”. [20, 21, 22]

Fun with TEI/XML, text mining, and the definition of librarianship.


 [1] Box - http://bit.ly/1QcvxLP
 [2] mirror - http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
 [3] xpath script - http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
 [4] catalog (index) - http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
 [5] search results - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
 [6] Baxter at VIAF - http://viaf.org/viaf/54178741
 [7] Baxter at WorldCat - http://www.worldcat.org/wcidentities/lccn-n50-5510
 [8] Baxter at Wikipedia - http://en.wikipedia.org/wiki/Richard_Baxter
 [9] box plot of dates - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
[10] box plot of words - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
[11] histogram of dates - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
[12] histogram of words - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
[13] HTML - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
[14] Shakespeare - http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
[15] astronomy - http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
[16] HathiTrust work - http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
[17] colors - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
[18] “big” names - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
[19] “great” ideas - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
[20] Thoreau - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
[21] Emerson - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
[22] Channing - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html


—
Eric Lease Morgan, Librarian
University of Notre Dame

Re: [CODE4LIB] eebo [resolved and coolness!!]

2015-06-05 Thread Eric Lease Morgan

On Jun 5, 2015, at 8:10 AM, Eric Lease Morgan  wrote:

> Does anybody here have experience reading the SGML/XML files representing the 
> content of EEBO? 

I ultimately found the EEBO files in the form of TEI, and then I was able to 
transform one of them into VERY functional HTML5. Coolness! Here’s the recipe:

 1. download P5 from Box [1]
 2. download stylesheets from GitHub [2]
 3. transform using Saxon [3]
 4. save output to HTTP server 
 5. open in browser [4]
 6. read results AND get scanned image

Nice clean data + fully functional stylesheets = really cool output

[1] P5 - http://bit.ly/1QcvxLP
[2] stylesheets - https://github.com/TEIC/Stylesheets
[3] transform - java -cp saxon9he.jar net.sf.saxon.Transform -t 
-s:/var/www/html/sandbox/eebo-tcp/xml/A0/A06567.xml 
-xsl:/var/www/html/sandbox/eebo-tcp/style/html5/html5.xsl > 
/var/www/html/tmp/eebo.html
[4] output - http://dh.crc.nd.edu/tmp/eebo.html

—
ELM

Re: [CODE4LIB] eebo

2015-06-05 Thread Eric Lease Morgan

On Jun 5, 2015, at 8:20 AM, Ethan Gruber  wrote:

>> Does anybody here have experience reading the SGML/XML files representing
>> the content of EEBO?
> 
> Are these in TEI? Back when I worked for the University of Virginia
> Library, I did a lot of clean up work and migration of Chadwyck-Healey
> stuff into TEI-P4 compliant XML (thousands of files), but unfortunately all
> of the Perl scripts to migrate old garbage SGML into XML are probably gone.
> 
> How many of these things are really worth keeping, i.e., were not digitized
> by any other organization that has freely published them online?

The data I have comes in two flavors: 1) some flavor of SGML, and 2) some 
flavor of XML which is TEI-like, but not TEI. All of the files are worth 
keeping because I get the basic bibliographic information (id, author, title, 
date, keywords/subjects), as well as transcribed text. (No images.) Given such 
data, I think I can provide interesting, cool, and “kewl” services. Given the 
id number, I may then be able to link to the scanned image. Wish me luck. —ELM

[CODE4LIB] eebo

2015-06-05 Thread Eric Lease Morgan

Does anybody here have experience reading the SGML/XML files representing the 
content of EEBO? 

I’ve gotten my hands on approximately 24 GB of SGML/XML files representing the 
content of EEBO (Early English Books Online). This data does not include page 
images. Instead it includes metadata of various ilks as well as the transcribed 
full text. I desire to reverse engineer the SGML/XML in order to: 1) provide an 
alternative search/browse interface to the collection, and 2) support various 
types of text mining services. 

While I am making progress against the data, it would be nice to learn of other 
people’s experience so I do not not re-invent the wheel (too many times). ‘Got 
ideas?

—
Eric Lease Morgan
University Of Notre Dame

[CODE4LIB] hathitrust research center user group meeting [rescheduled]

2015-06-04 Thread Eric Lease Morgan

The HathiTrust Research Center User’s Group Meeting (conference call) has been 
rescheduled for next Thursday, June 11:

   Who - anybody and everybody
  What - a discussion of all things HathiTrust Research Center
  When - Thursday, June 11 from 3-4:00 Eastern Time
 Where - via the telephone: (812) 856-3600 or (317) 278-7008 with PIN 803140
   Why - because both you and they have something to offer librarianship

More specifically, next Thursday's conference call is about at least two 
things: 1) your concerns regarding the Center, and 2) a discussion of my 
fledgling "Workset Browser". [1, 2] This is an opportunity for you to learn the 
why's & wherefore's of the Center, as well as influence the direction of 
programming initiatives. For example, you can learn more about their 
authorization and copyright restrictions. You can also discuss how you think 
the Center can provide support for the digital humanities and text mining. 

[1] HathiTrust Research Center - http://hathitrust.org/htrc
[2] blog posting describing the "Browser" -  
http://blogs.nd.edu/emorgan/2015/05/htrc-workset-browser/

—
Eric Lease Morgan
University of Notre Dame

Re: [CODE4LIB] hathitrust research center workset browser [github]

2015-06-02 Thread Eric Lease Morgan

I believe I have created a repository of my HTRC Workset Browser code (shell 
and Python scripts) on GitHub. [1] From the Quick Start section of the README:

  1. Download the software putting the bin and etc directories in the same 
directory.
  2. Change to the directory where the bin and etc directories have been saved.
  3. Build a collection by issuing the following command:

   ./bin/build-corpus.sh thoreau etc/rsync-thoreau.sh

  If all goes well, the Browser will create a new directory named thoreau,
  rsync a bunch o' JSON files from the HathiTrust to your computer, index
  the JSON files, do some textual analysis against the corpus, create a
  simple database ("catalog"), and create a few more reports. You can then
  peruse the files in the newly created thoreau directory. If this worked,
  then repeat the process for the other rsync files found in the etc
  directory.

Probably the first issue people will have is the path to their version of 
Python. (Sigh.)

[1] repository - https://github.com/ericleasemorgan/HTRC-Workset-Browser

—
Eric “Git Ignorant” Morgan

[CODE4LIB] hathitrust research center user group meeting

2015-06-01 Thread Eric Lease Morgan

Consider participating in Thursday's HathiTrust Research Center User Group 
Meeting:

Who - anybody and everybody
   What - a discussion of all things HathiTrust Research Center
   When - this Thursday, June 4 from 3-4:00 Eastern Time
  Where - via the telephone: (812) 856-3600 or (317) 278-7008 with PIN 803140
Why - because both you and they have something to offer librarianship

More specifically, Thursday's conference call is about at least two things: 1) 
your concerns regarding the Center, and 2) a discussion of my fledgling 
"Workset Browser". [1, 2] This is an opportunity for you to learn the why's & 
wherefore's of the Center, as well as influence the direction of programming 
initiatives. For example, you can learn more about their authorization and 
copyright restrictions. You can also discuss how you think the Center can 
provide support for the digital humanities and text mining. 

[1] HathiTrust Research Center - http://hathitrust.org/htrc
[2] blog posting describing the "Browser" - http://ntrda.me/1FUGP2g

—
Eric Lease Morgan
University of Notre Dame

Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan

On Jun 1, 2015, at 10:58 AM, davesgonechina  wrote:

> They just informed me I need a .edu address. Having trouble understanding
> the use of the term "public domain" here.

  Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ

Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan

On Jun 1, 2015, at 4:33 AM, davesgonechina  wrote:

> If your *institutional* email address is not on their whitelist (not sure
> if it is limited to subscribing ones, they don't say) you cannot register
> using the signup form, instead you can only request an account by briefly
> explaining why you want one. Weird, because they'd have potentially learned
> more about me if they just let me put my gmail address in the signup form.
> 
> I don't get it - can all users download public domain content? If they give
> me an account, will I be indistinguishable from a subscribing institution?
> If not, why the extra hoops?

Dave, you are the second person to bring this “white listing” issue to my 
attention. Bummer! Yes, apparently, unless your email address is a part of 
wider something or another, then you need to be authorized to use the Research 
Center. Weird! In my opinion, while the Research Center’s tools work, I believe 
the site suffers from usability issues.

In any event, I have enhanced the auto-generated reports created by my 
“Browser”, and while they are very textual, I also believe they are insightful. 
For example, the complete works of:

  * William Ellery Channing - http://bit.ly/browser-channing-about
  * Jane Austen - http://bit.ly/browser-austen-about
  * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
  * Henry David Thoreau - http://bit.ly/browser-thoreau-about

—
Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan

Re: [CODE4LIB] hathitrust research center workset browser

2015-05-28 Thread Eric Lease Morgan

On May 27, 2015, at 6:33 PM, Karen Coyle  wrote:

>> In my copious spare time I have hacked together a thing I’m calling the 
>> HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
>> “distant reading” against corpora from the HathiTrust. [0, 1] ...
>> 
>> 'Want to give it a try? For a limited period of time, go to the HathiTrust 
>> Research Center Portal, create (refine or identify) a collection of personal 
>> interest, use the Algorithms tool to export the collection's rsync file, and 
>> send the file to me. I will feed the rsync file to the Browser, and then 
>> send you the URL pointing to the results.
>> 
>> [0] introduction in a blog posting - http://ntrda.me/1FUGP2g
>> [1] HTRC Workset Browser - http://bit.ly/workset-browser
> 
> Eric, what happens if you access this from a non-HT institution? When I go to 
> HT I am often unable to download public domain titles because they aren't 
> available to members of the general public.

The short answer is, “Nothing”.

The long answer is… longer. The HathiTrust proper is accessible to anybody, but 
the downloading of public domain content is only available to subscribing 
institutions.

On the other hand, the “Workset Browser” is designed to work off the HathiTrust 
Research Center Portal, not the HathiTrust proper. The Portal is located at 
http://sharc.hathitrust.org From there anybody can search the collection of 
public domain content, create collections, and apply various algorithms against 
collections. One of the algorithms is “create RSYNC file” which, in turn, 
allows you to download bunches o’ metadata describing the items in your 
collection. (There is also a “download as MARC” algorithm.) This rsync file is 
the root of the Workset Browser. Feed the Browser a rsync file, and the Browser 
will mirror content locally, index it, and generate reports describing the 
collection. 

Thank you for asking. Many people do not know there is a HathiTrust Research 
Center.

—
Eric Morgan

Re: [CODE4LIB] hathitrust research center workset browser [call for worksets]

2015-05-27 Thread Eric Lease Morgan

On May 26, 2015, at 11:30 AM, Eric Lease Morgan  wrote:

> In my copious spare time I have hacked together a thing I’m calling the 
> HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
> “distant reading” against corpora from the HathiTrust. [0]
> 
>   [0] introductory Workset Browser blog posting - http://ntrda.me/1FUGP2g

Help me put the my fledgling Browser through some paces; this is a call for 
HathiTrust Research Center worksets.

For a limited period of time, go to the HathiTrust Research Center Portal, 
create (refine or identify) a collection of personal interest, use the 
Algorithms tool to export the collection's rsync file, and send the file to me. 
[1] I will feed the rsync file to the Browser, and then send you the URL 
pointing to the results. Let’s see what happens?

[1] HathiTrust Research Center Portal - https://sharc.hathitrust.org

—
Eric Morgan

[CODE4LIB] hathitrust research center workset browser

2015-05-26 Thread Eric Lease Morgan

In my copious spare time I have hacked together a thing I’m calling the 
HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
“distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center 
workset of interest — your corpus, 2) feed the workset’s rsync file to the 
Browser, 3) have the Browser download, index, and analyze the corpus, and 4) 
enable to reader to search, browse, and interact with the result of the 
analysis. With varying success, I have done this with a number of worksets 
ranging on topics from literature, philosophy, Rome, and cookery. The best 
working examples are the ones from Thoreau and Austen. [2, 3] The others are 
still buggy.

As a further example, the Browser can/will create reports describing the corpus 
as a whole. This analysis includes the size of a corpus measured in pages as 
well as words, date ranges, word frequencies, and selected items of interest 
based on pre-set “themes” — usage of color words, name of “great” authors, and 
a set of timeless ideas. [4] This report is based on more fundamental reports 
such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8] 

The whole thing is written in a combination of shell and Python scripts. It 
should run on just about any out-of-the-box Linux or Macintosh computer. Take a 
look at the code. [9] No special libraries needed. (“Famous last words.”) In 
its current state, it is very Unix-y. Everything is done from the command line. 
Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a 
Renaissance cartoon, the Browser, in its current state, is only a sketch. Only 
later will a more full-bodied, Web-based interface be created. 

The next steps are numerous and listed in no priority order: putting the whole 
thing on GitHub, outputting the reports in generic formats so other things can 
easily read them, improving the terminal-based search interface, implementing a 
Web-based search interface, writing advanced programs in R that chart and graph 
analysis, provide a means for comparing & contrasting two or more items from a 
corpus, indexing the corpus with a (real) indexer such as Solr, writing a 
“cookbook” describing how to use the browser to to “kewl” things, making the 
metadata of corpora available as Linked Data, etc.

'Want to give it a try? For a limited period of time, go to the HathiTrust 
Research Center Portal, create (refine or identify) a collection of personal 
interest, use the Algorithms tool to export the collection's rsync file, and 
send the file to me. I will feed the rsync file to the Browser, and then send 
you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of 
librarianship.

Links

   [1] HTRC Workset Browser - http://bit.ly/workset-browser
   [2] Thoreau - http://bit.ly/browser-thoreau
   [3] Austen - http://bit.ly/browser-austen
   [4] Thoreau report - http://ntrda.me/1LD3xds
   [5] Thoreau dictionary (frequency list) - http://bit.ly/thoreau-dictionary
   [6] usage of color words in Thoreau — http://bit.ly/thoreau-colors
   [7] unique words in the corpus - http://bit.ly/thoreau-unique
   [8] Thoreau “catalog” — http://bit.ly/thoreau-catalog
   [9] source code - http://ntrda.me/1Q8pPoI
  [10] HathiTrust Research Center - https://sharc.hathitrust.org

— 
Eric Lease Morgan, Librarian
University of Notre Dame

Re: [CODE4LIB] is python s l o o o w ? [resolved]

2015-05-18 Thread Eric Lease Morgan

On May 18, 2015, at 9:23 PM, Galen Charlton  wrote:

>> I have two scripts, attached. They do EXACTLY the same thing
>> in almost EXACTLY the same manner, but the Python script is
>> almost 25 times slower than the Perl script:
> 
> I'm no Python expert, but I think that the difference is much more
> likely due to which JSON processor is being used.  I suspect your Perl
> environment has the JSON::XS module, which is written in C, is fast,
> and is automatically invoked (if present) by "use JSON;".
> 
> In contrast, I believe that the Python "json" library is written in
> Python itself.  I tried swapping in cjson and UltraJSON [1] in place
> of "json" in your Python script, and in both cases it ran rather
> faster.
> 
> [1] https://github.com/esnme/ultrajson


Thank you. After using the Python module ujson instead of json, the speed of my 
two scripts is now all but equal. Whew! —Eric

[CODE4LIB] is python s l o o o w ?

2015-05-18 Thread Eric Lease Morgan


Is it just me, or is Python  s l o o o w  when compared to Perl?

I have two scripts, attached. They do EXACTLY the same thing in almost EXACTLY 
the same manner, but the Python script is almost 25 times slower than the Perl 
script:

  $ time bin/json2catalog.py sample/ > sample.db 2>/dev/null
  real 0m10.344s
  user 0m10.281s
  sys 0m0.059s

  $ time bin/json2catalog.pl sample/ > sample.db 2>/dev/null
  real 0m0.364s
  user 0m0.314s
  sys 0m0.048s

When I started learning Python, and specifically learning Python’s Natural 
Language Toolkit (NLTK), I thought this slowness was do to the large NLTK 
library, but now I’m not so sure. It is just me, or is Python really  s l o o o 
w ? Is there anything I can do to improve/optimize my Python code?

—
Eric Lease Morgan

#!/usr/bin/env python2

# json2catalog.py - create a "catalog" from a set of HathiTrust json files

# Eric Lease Morgan 
# May 18, 2015 - first cut; see https://sharc.hathitrust.org/features


# configure
HEADER   = "id\ttitle\tpublication date\tpage count\tHathiTrust URL\tlanguage\tMARC (JSON) URL\tWorldCat URL"
WORLDCAT = 'http://worldcat.org/oclc/'

# require
import glob
import json
import sys
import os

# sanity check
if len( sys.argv ) != 2 :
	print "Usage:", sys.argv[ 0 ], ''
	quit()

# get input
directory = sys.argv[ 1 ]

# intialize
print( HEADER )

# process each json file in the given directory
for filename in glob.glob( directory + '*.json' ):

	# open and read the file
	with open( filename ) as data: metadata = json.load( data )
		
	# parse
	id   = metadata[ 'id' ]
	title= metadata[ 'metadata' ]['title' ]
	date_created = metadata[ 'metadata' ][ 'dateCreated' ]
	page_count   = metadata[ 'features' ][ 'pageCount' ]
	handle   = metadata[ 'metadata' ][ 'handleUrl' ]
	language = metadata[ 'metadata' ][ 'language' ]
	marc = metadata[ 'metadata' ][ 'htBibUrl' ]
	worldcat = WORLDCAT + metadata[ 'metadata' ][ 'oclc' ]

	# create a list and print it
	metadata = [ id, title, date_created, page_count, handle, language, marc, worldcat ]
	print( '\t'.join( map( str, metadata ) ) )
	
# done
quit()

#!/usr/bin/perl

# json2catalog.pl - create a "catalog" from a set of HathiTrust json files

# Eric Lease Morgan 
# May 15, 2015 - first cut; see https://sharc.hathitrust.org/features


# configure
use constant DEBUG=> 0;
use constant WORLDCAT => 'http://worldcat.org/oclc/';
use constant HEADER   => "id\ttitle\tpublication date\tpage count\tHathiTrust URL\tlanguage\tMARC (JSON) URL\tWorldCat URL\n";

# require
use Data::Dumper;
use JSON;
use strict;

# get input; sanity check
my $directory = $ARGV[ 0 ];
if ( ! $directory ) {

	print "Usage: $0 \n";
	exit;
	
}

# initialize
$| = 1;
binmode( STDOUT, ':utf8' );
print HEADER;

# process each file in the given directory
opendir DIRECTORY, $directory or die "Error in opening $directory: $!\n";
while ( my $filename = readdir( DIRECTORY ) ) {

	# only .json files
	next if ( $filename !~ /json$/ );

	# convert the json file to a hash
	my $json = decode_json &slurp( "$directory$filename" );
	if ( DEBUG ) { print Dumper( $json ) }

	# parse
	my $id= $$json{ 'id' };
	my $title = $$json{ 'metadata' }{ 'title' };
	my $date  = $$json{ 'metadata' }{ 'pubDate' };
	my $pagecount = $$json{ 'features' }{ 'pageCount' };
	my $handle= $$json{ 'metadata' }{ 'handleUrl' };
	my $language  = $$json{ 'metadata' }{ 'language' };
	my $marc  = $$json{ 'metadata' }{ 'htBibUrl' };
	my $worldcat  = WORLDCAT . $$json{ 'metadata' }{ 'oclc' };

	# dump
	print "$id\t$title\t$date\t$pagecount\t$handle\t$language\t$marc\t$worldcat\n";
	 
}

# clean up and done
closedir(DIRECTORY);
exit;


# read and return the contents of a file
sub slurp {
 
	my $f = shift;
	open ( F, $f ) or die "Can't open $f: $!\n";
	my $r = do { local $/;  };
	close F;
	return $r;
 
}

Re: [CODE4LIB] Protagonists

2015-04-14 Thread Eric Lease Morgan

If a peson could denote the characteristics of both the main (female) character 
as well as the protagonist, then bits of natural language processing (text 
mining) might be able to address this problem. —Eric “When You Have A Hammer, 
Everything Begins To Look Like a Nail” Morgan

[CODE4LIB] 3,082

2015-03-04 Thread Eric Lease Morgan

  Code4Lib is now 3,082 subscribers strong. Yeah! Almost time to do some 
analysis. —ELM

Re: [CODE4LIB] linked data question

2015-02-26 Thread Eric Lease Morgan

On Feb 26, 2015, at 9:48 AM, Owen Stephens  wrote:

> I highly recommend Chapter 6 of the Linked Data book which details different 
> design approaches for Linked Data applications - sections 6.3  
> (http://linkeddatabook.com/editions/1.0/#htoc84) summarises the approaches as:
> 
>   1. Crawling Pattern
>   2. On-the-fly dereferencing pattern
>   3. Query federation pattern
> 
> Generally my view would be that (1) and (2) are viable approaches for 
> different applications, but that (3) is generally a bad idea (having been 
> through federated search before!)

And at the risk of sounding like a broken record, owen++ because the "Linked 
Data book” is a REALLY good read!! [0] While it is computer science-y, it is 
also authoritative, easy-to-read, full of examples, and just plain makes a 
whole lot of sense. 

[0] linked data book - http://linkeddatabook.com/

—
Eric M.

Re: [CODE4LIB] linked data question

2015-02-26 Thread Eric Lease Morgan

On Feb 25, 2015, at 3:12 PM, Sarah Weissman  wrote:

> I am kind of new to this linked data thing, but it seems like the real
> power of it is not full-text search, but linking through the use of shared
> vocabularies. So if you have data about Jane Austen in your database and
> you are using the same URI as other databases to represent Jane Austen in
> your data (say http://dbpedia.org/resource/Jane_Austen), then you (or
> rather, your software) can do an exact search on that URI in remote
> resources vs. a fuzzy text search. In other words, linked data is really
^
> supposed to be linked by machines and discoverable through URIs. If you
 
> visit the URL: http://dbpedia.org/page/Jane_Austen you can see a
> human-interpretable representation of the data a SPARQL endpoint would
> return for a query for triples {http://dbpedia.org/page/Jane_Austen ?p ?o}.
> This is essentially asking the database for all subject-predicate-object
> facts it contains where Jane Austen is the subject.


Again, seweissman++  The implementation of linked data is VERY much like the 
implementation of a relational database over HTTP, and in such a scenario, the 
URIs are the database keys. —ELM

Re: [CODE4LIB] linked data question

2015-02-26 Thread Eric Lease Morgan

On Feb 25, 2015, at 2:48 PM, Esmé Cowles  wrote:

>> In the non-techie library world, linked data is being talked about (perhaps 
>> only in listserv traffic) as if the data (bibliographic data, for instance) 
>> will reside on remote sites (as a SPARQL endpoint??? We don't know the 
>> technical implications of that), and be displayed by > centralized inter-national catalog> by calling data from that remote site. 
>> But the original question was how the data on those remote sites would be 
>>  - how can I start my search by searching for that remote 
>> content?  I assume there has to be a database implementation that visits 
>> that data and pre-indexes it for it to be searchable, and therefore the 
>> index has to be local (or global a la Google or OCLC or its 
>> bibliographic-linked-data equivalent). 
> 
> I think there are several options for how this works, and different 
> applications may take different approaches.  The most basic approach would be 
> to just include the URIs in your local system and retrieve them any time you 
> wanted to work with them.  But the performance of that would be terrible, and 
> your application would stop working if it couldn't retrieve the URIs.
> 
> So there are lots of different approaches (which could be combined):
> 
> - Retrieve the URIs the first time, and then cache them locally.
> - Download an entire data dump of the remote vocabulary and host it locally.
> - Add text fields in parallel to the URIs, so you at least have a label for 
> it.
> - Index the data in Solr, Elasticsearch, etc. and use that most of the time, 
> esp. for read-only operations.


Yes, exactly. I believe Esmé has articulated the possible solutions well. 
escowles++  —ELM

[CODE4LIB] koha and ebsco

2015-02-20 Thread Eric Lease Morgan

Through the grapevine I learned of the following announcement regarding Koha 
and EBSCO:

  Rome --February 11, 2015. Koha is an ILS created by librarians
  for librarians. EBSCO Information Services is a family-owned
  company dedicated to solutions that bring real improvements to
  libraries. Together, the two are working to provide a viable
  web-based, open source ILS for libraries of all types. Koha
  libraries reached out to EBSCO for support of some important
  projects, and EBSCO agreed to partner to accomplish the following
  for Koha:

* Strategic upgrade of of Koha's core full-text search engine
  technology to ElasticSearch; ensuring long-term viability

* Increased functionality and accuracy of facets

* Development of a browse function (author, title, subject,
  call number)

* MARC to RDF crosswalk enhancing capability of linking to
  online data repositories (linked data)

* Greater flexibility in ingesting metadata schemes beyond MARC21

* Improved speed

* Enhanced patron functionality API access

  The financial support from EBSCO will be provided via the Koha
  Gruppo Italiano founded by the American Academy in Rome, American
  University of Rome, and the Pontificia Università della Santa
  Croce which will be assisted in this development and integration
  by key Koha contributors ByWater Solutions, Catalyst IT, and
  Cineca. In keeping with open source tradition, these enhancements
  to Koha are truly open source and will be available for others to
  use, modify, and re-distribute.

  http://librarytechnology.org/ltg-displaytext.pl?RC=20347

Last year I had the wonderful opportunity of working with the Italian-based 
Koha community, and I respect their forethought and insight. Kudos. [1]

[1] Koha Italy - https://www.facebook.com/KohaGruppoItaliano

—
Eric Lease Morgan

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-20 Thread Eric Lease Morgan

On Feb 16, 2015, at 4:54 PM, Levy, Michael  wrote:

> I think you can accomplish what you want by using ICUFoldingFilterFactory
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
> 
> which should simply perform ICU (cf http://site.icu-project.org/) based 
> character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
> 
> In schema.xml I generally have in both index and query:
> 
> 
> 

For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Eric Lease Morgan

I know the documents I’m indexing are written in Spanish, and adding the 
following filters to my field definition, I believe I have resolved my problem:

  
  

In other words, my searchable content is defined thus:

  

And “text_general” is defined to include the filters in both the index and 
query sections:

Re: [CODE4LIB] Job Posting [assistant manager, vancouver public library]

2015-02-13 Thread Eric Lease Morgan

[The following announcement is being passed on by request. —ELM]


Assistant Manager – Websites and Online Engagement Digital Services

Vancouver Public Library (VPL) is seeking a dynamic, strategic, and creative 
Assistant Manager to join the Digital Services department. Reporting to the 
Manager, Digital Services, and leading a team of 4 full-time equivalent (FTE) 
staff, the successful candidate will be responsible for ensuring that VPL’s 
online presence is engaging, relevant, and responsive to patrons’ needs. 

The Vancouver Public Library website is our flagship communication channel, 
with over 5 million visits a year.  In 2015 we are planning a substantial 
refresh of the site to increase digital engagement with our services and 
collections, and ensure that we provide a seamless and positive user experience 
for our patrons. We also maintain a strong social media presence which aims to 
educate and excite patrons and showcase the full range of resources the library 
offers.


The Position

In consultation with the public, library staff, and other stakeholders, you 
will be responsible for guiding the future direction of our websites. You will 
develop metrics and evaluation tools for assessing the success of www.vpl.ca 
and informing future changes to content, design, navigation, and architecture. 
Together with your team and our Library Systems (IT) department you will 
monitor emerging technologies and trends and pilot and implement creative, 
innovative ways of delivering digital services and collections and making our 
overall web presence more dynamic, engaging, and effective. 

You will be a champion for best practices in content development, both on the 
website itself and through our social media channels. Through training, 
coaching, and development of guidelines and procedures, you will help staff 
across the VPL system understand how their contributions to our online 
engagement platforms build connections with patrons and enhance the services we 
offer. You will maintain a strong awareness of emerging engagement tools and 
identify opportunities for us to expand our presence into new channels. You 
will ensure that user experience is the central focus for all of our web 
initiatives.

Your team includes two Web Librarians, a Web Technician, and a Web Graphics 
Technician. As a part of the Digital Services leadership team, you will also 
work collaboratively within the department to support the public in their use 
of all web-based library services including electronic resources, eBooks, and 
digital collections. You will participate in ensuring that our work is focused 
on achieving VPL’s strategic priorities, both within Digital Services and 
across the library system.


Qualifications and Experience

This position requires excellent collaboration and communication skills, 
thorough knowledge of current trends and best practices in the use of 
technology in delivering web-based public library services, significant 
experience in the planning, design, development, promotion and maintenance of 
websites, and demonstrated supervisory or leadership experience. A background 
in project management, web management, and user-centred design is essential. We 
are looking for an innovative, flexible individual with a demonstrated ability 
to develop positive working relationships, lead and develop staff teams, manage 
multiple projects and competing priorities, and assist staff in participating 
in and being open to change.

Qualifications include an MLS/MLIS degree from an ALA accredited post-secondary 
institution and a minimum of 2 years of recent relevant experience, including 
project management, website management, and supervisory experience. 
The Workplace

Vancouver Public Library is the third-largest library system in Canada and 
offers exceptional collections, services and technology at 21 branch libraries 
and a superb virtual library with over 5 million web visitors per year and an 
extensive collection of digital resources. If you would like to make a 
meaningful contribution to the City of Vancouver through this exciting, 
forward-looking position, we would like to hear from you.

This position is within the library’s bargaining unit, CUPE 391. The salary 
range begins at $64,482 with annual increments rising to a maximum of $76,112. 
The library offers a comprehensive benefits package including MSP, extended 
health, dental, pension, and annual vacation of 22 days for professional 
positions. 

Expressions of interest accompanied by a résumé should be submitted by 5:00 pm 
on Friday, March 6, 2015 by ONE of the following methods:

  Mail: Human Resources Department
  Vancouver Public Library
  350 West Georgia Street
  Vancouver, BC V6B 6B1
  OR Email: care...@vpl.ca

Please quote the competition # in the subject line when applying electronically 
and upload your cover letter and resume / CV as one attachment. Ensure your 
application has one of the following file extensions: .pdf

Re: [CODE4LIB] indexing word documents using solr [diacritics]

2015-02-12 Thread Eric Lease Morgan

How do I retain diacritics in a Solr index, and how to I search for words 
containing them?

I have extracted the plain text out of set of Word documents. I have then used 
a Perl interface (WebService::Solr) to add the plain text to a Solr index using 
a field type called text_general:


  



  
  




  


It seems as if I am unable to search for words like ejecución because the 
diacritic gets in the way. What am I doing wrong?

— 
Eric

Re: [CODE4LIB] indexing word documents using solr

2015-02-11 Thread Eric Lease Morgan

On Feb 10, 2015, at 11:46 AM, Erik Hatcher  wrote:

> bin/post -c collection_name /path/to/file.doc

The almost trivial command to index a Word document in Solr, above, is most 
certainly appealing, but I’m wondering about the underlying index’s schema.

Tika makes every effort to extract as much metadata from Word documents as 
possible. This metadata includes dates, titles, authors, names of applications, 
last edit, etc. Some of this data can be very useful. The metadata can be 
packaged up as an XML file/stream and then sent to Solr for indexing. "Tastes 
great. Less filling.” But my question is, “To what degree does Solr know what 
to do with the metadata when the (kewl) command, above, is seemingly so 
generic? Does one need to create a Solr schema to specifically accommodate the 
Tika-created metadata, or do such things also come for ‘free’?”

— 
Eric Morgan

Re: [CODE4LIB] indexing word documents using solr

2015-02-10 Thread Eric Lease Morgan

On Feb 10, 2015, at 11:46 AM, Erik Hatcher  wrote:

> First, with Solr 5, it’s this easy:

  Where can I download Solr 5 because none of the other version seem to be 
complete. —ELM

[CODE4LIB] indexing word documents using solr

2015-02-10 Thread Eric Lease Morgan

Can somebody point me to a good tutorial on how to index Word documents using 
Solr?

I have a few hundred Microsoft Word documents I want to search. Through the use 
of the Tika library it seems as if I ought to be able to index my Word 
documents directly into Solr, but none of the tutorials I have found on the Web 
are complete. Missing directories. Missing files. Documentation for versions 
unreleased. Etc.

Put another way, Tika can create a (nice) XHTML file complete with some useful 
metadata that can all be fed to Solr for indexing, but I can barely get out of 
the starting gate. Have you indexed Word documents using Solr, and if so, then 
how? 

—
Eric Morgan

[CODE4LIB] joy

2015-01-27 Thread Eric Lease Morgan

  It is a joy to manage this mailing list, and I say that with all sincerity. 
—Eric Morgan

Re: [CODE4LIB] circulation statistics

2015-01-15 Thread Eric Lease Morgan

  The replies received have all been very helpful. Thank you! —Eric M.

[CODE4LIB] circulation statistics

2015-01-13 Thread Eric Lease Morgan

Does anybody here know how to extract circulation statistics from an library 
catalog? Specifically, given a date range, are you able to create a list of the 
most frequently borrowed books ordered by the number of times they’ve been 
circulated?

I have a colleague who wants to digitize sets of modern literature and then do 
text analysis against the result. In an effort to do the analysis against 
popular literature, he wants to create a list of… popular titles. Getting a 
list of such a thing from library circulation statistics sounds like a logical 
option to me. 

Does somebody here know how to do this? If you know how to do it against Ex 
Libris’s Aleph, then that is a bonus. 

—
Eric Morgan

Re: [CODE4LIB] PBCore RDF Ontology Hackathon Wiki page

2015-01-05 Thread Eric Lease Morgan

On Jan 5, 2015, at 1:35 PM, Karen Coyle  wrote:

> 1) Everyone should read at least the first chapters of the Allemang book, 
> Semantic Web for the Working Ontologist:
> http://www.worldcat.org/title/semantic-web-for-the-working-ontologist-effective-modeling-in-rdfs-and-owl/oclc/73393667

+2 because it is a very good book


> 2) Everyone should understand the RDF meaning of classes, properties, domain 
> and range before beginning. (cf: 
> http://kcoyle.blogspot.com/2014/11/classes-in-rdf.html)

+1 for knowing the distinctions between these things, yes


> 3) Don't lean too heavily on Protege. Protege is very OWL-oriented and can 
> lead one far astray. It's easy to click on check boxes without knowing what 
> they really mean. Do as much development as you can without using Protege, 
> and do your development in RDFS not OWL. Later you can use Protege to check 
> your work, or to complete the code.

+1 but at the same time workshops are good places to see how things get done in 
a limited period of time.


> 4) Develop in ntriples or turtle but NOT rdf/xml. RDF differs from XML in 
> some fundamental ways that are not obvious, and developing in rdf/xml masks 
> these differences and often leads to the development of not very good 
> ontologies.

+1 & -1 because each of the RDF serializations have its own advantages and 
disadvantages


—
Eric Morgan

Re: [CODE4LIB] lita

2015-01-05 Thread Eric Lease Morgan

On Jan 5, 2015, at 11:25 AM, Sylvain Machefert  
wrote:

>> Interesting and thank you. Code4Lib only needs fifty more subscribers to 
>> equal LITA’s size. I think this just goes to show, with the advent of the 
>> Internet, centralized authorities are not as necessary/useful as they once 
>> used to be. —ELM
> 
> For a list created more than 10 years ago, can we trust the number of 
> subscribers figure ? How many dead addresses ? (not saying that number of 
> members of an association == active members, sure).

There are zero dead mailing list addresses because the LISTSERV software prunes 
such things on a daily basis. Yes, we can trust the number of subscribers, but 
that does not mean all of the subscribers actively participate in the 
community. —ELM

Re: [CODE4LIB] lita

2015-01-05 Thread Eric Lease Morgan

>> I’m curious, how large is LITA (Library and Information Technology
>> Association)? [0] How many members does it have?
> 
> Apparently it has around 3000 members this year. I found this on the ALA
> membership statistics page:
> 
> http://www.ala.org/membership/membershipstats_files/divisionstats#lita


Interesting and thank you. Code4Lib only needs fifty more subscribers to equal 
LITA’s size. I think this just goes to show, with the advent of the Internet, 
centralized authorities are not as necessary/useful as they once used to be. 
—ELM

[CODE4LIB] lita

2015-01-05 Thread Eric Lease Morgan

I’m curious, how large is LITA (Library and Information Technology 
Association)? [0] How many members does it have? 

[0] LITA - http://www.ala.org/lita/

—
ELM

Re: [CODE4LIB] NEC4L

2014-12-24 Thread Eric Lease Morgan

  It is so cool that we have “franchises”. —Eric Morgan

[CODE4LIB] linked data and open access

2014-12-19 Thread Eric Lease Morgan

I don’t know about y’all, but it seems to me that things like linked data and 
open access are larger trends in Europe than here in the United States. Is 
there are larger commitment to sharing in Europe when compared to the United 
States? If so, is this a factor based on the nonexistence of a national library 
in the United States? Is this your perception too? —Eric Morgan

Re: [CODE4LIB] Starting with Virtuoso - tutorial etc.

2014-12-17 Thread Eric Lease Morgan

On Dec 17, 2014, at 10:10 AM, Mixter,Jeff  wrote:

> If you want to test out a bare-bones triple store, I would suggest 4Store 
> (http://4store.org/). It has pre-compiled installs for Unix and Unix-like 
> systems (although not Windows). It supports SPARQL 1.1 and is relatively easy 
> to tweak/configure.

Regarding 4Store, I concur. 4Store is my SPARQL endpoint of RDF created from 
archival (EAD and MARC) materials. Regarding the “futurities” of Virtuoso, I 
also agree. —Eric

Re: [CODE4LIB] Starting with Virtuoso - tutorial etc.

2014-12-17 Thread Eric Lease Morgan

On Dec 17, 2014, at 9:52 AM, Nicola Carboni  wrote:

> I am collecting some resources (beginner level) in order to start using 
> Virtuoso (OpenSource Edition) for a project I am working with. I would like 
> to use it both for hosting triples and for its sponger (CSV to RDF). I 
> sincerely never used it, but I would like to give it try. Do you have some 
> recommendations, like nice books or tutorial (even video) about it?

I have not used Virtuoso extensively, but I have compiled and installed it. It 
was a big but painless compiling process. It seems to me as if Virtuoso is the 
most feature-rich (open source) triple store available. Please consider sharing 
with the group any of your future experiences with it. —Eric Morgan

Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Eric Lease Morgan

On Dec 9, 2014, at 8:25 AM, Kyle Banerjee  wrote:

> I've just started a project that involves harvesting large numbers of
> scanned PDF's and extracting information from the text from the OCR output.
> The process I've started with -- use imagemagick to convert to tiff and
> tesseract to pull out the OCR -- is more system intensive than I hoped it
> would be.

I’m not quite sure if I understand the question, but if all you want to do is 
pull the text out of an OCR’ed PDF file, then I have found both Tika and 
PDFtotext to be useful tools. [1, 2] Here’s a Perl script that takes a PDF as 
input and used to Tika to output the OCR’ed text:

  #!/usr/bin/perl

  # configure
  use constant TIKA => 'java -jar tika.jar -T ';

  # require
  use strict;

  # initialize; needs sanity checking
  my $cmd = TIKA . $ARGV[ 0 ];

  # do the work
  print system $cmd;

  # done
  exit;

Tika can run in a server mode making it more efficient for extracting the text 
from multiple files. 

On the other hand, if you need to do the OCR itself, then employing Tesseract 
is probably the way to go. 

[1] Tika - http://tika.apache.org
[2] PDFtoText - http://www.foolabs.com/xpdf/download.html

—
ELM

Re: [CODE4LIB] Registration for Code4Lib 2015 in Portland Oregon is NOW OPEN! [airbnb]

2014-12-08 Thread Eric Lease Morgan

On Dec 8, 2014, at 12:57 PM, Dana Jemison  wrote:

> Looks like the recommended hotel is already filled up.  Are there any other 
> options close by?

Mine is a unsolicited comment/endorsement for AirBnB as an additional source of 
accommodations, if it does not hurt the conference planning process. [1] With 
AirBnB believe you can get quite a nice place to stay that is larger, more 
hospitable, and lesser expensive than a hotel.

[1] AirBnB - http:/airbnb.com

—
Eric

1 2 3 4 5 6 7 8 >

1 - 100 of 725 matches

Mail list logo