[CODE4LIB] Tool for Named-Entity Recognition

2013-02-25 Thread Seth van Hooland
Dear colleagues,

You want to automate the discovery of people, place names and events within a 
large corpus of unstructured documents or metadata (e.g. description field)? 
Then you might want to use the Named-Entity Recognition (NER) extension for 
OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University / 
iMinds) and MasTIC (Université Libre de Bruxelles).

On http://freeyourmetadata.org/named-entity-extraction/, you will find all the 
information necessary to start experimenting with NER on your own. The 
extension was developed specifically in the context of a research paper, 
entitled Named-Entity Recognition: A Gateway Drug for Cultural Heritage 
Collections to the Linked Data Cloud?. A preprint of this paper can be found 
on http://freeyourmetadata.org/publications/named-entity-recognition.pdf. The 
paper also aims to foster a discussion within the Digital Library community 
regarding the quality of concepts described in knowledge bases (e.g. Freebase 
versus DBPedia) and the current struggle between schemes (e.g. schema.org 
versus Open Graph protocol).  

We will be presenting our work in North and Latin America in March (Boston), 
April (New York and Philadelphia), May (Quito) and June (New York and Montreal) 
so if you're located in one of those cities/areas and interested in 
collaborating or hosting a workshop on this topic, don't hesitate to get in 
touch. 

Kind regards, 

Seth van Hooland
Président du Master en Sciences et Technologies de l'Information et de la 
Communication (MaSTIC)
Université Libre de Bruxelles
Av. F.D. Roosevelt, 50 CP 123  | 1050 Bruxelles
http://homepages.ulb.ac.be/~svhoolan/
http://twitter.com/#!/sethvanhooland
http://mastic.ulb.ac.be
0032 2 650 4765
Office: DC11.102

Seth van Hooland
Président du Master en Sciences et Technologies de l'Information et de la 
Communication (MaSTIC)
Université Libre de Bruxelles
Av. F.D. Roosevelt, 50 CP 123  | 1050 Bruxelles
http://homepages.ulb.ac.be/~svhoolan/
http://twitter.com/#!/sethvanhooland
http://mastic.ulb.ac.be
0032 2 650 4765
Office: DC11.102


Re: [CODE4LIB] Tool for Named-Entity Recognition

2013-02-25 Thread Eric Lease Morgan
On Feb 25, 2013, at 8:12 AM, Seth van Hooland svhoo...@ulb.ac.be wrote:

 You want to automate the discovery of people, place names and events within a 
 large corpus of unstructured documents or metadata (e.g. description field)? 
 Then you might want to use the Named-Entity Recognition (NER) extension for 
 OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University 
 / iMinds) and MasTIC (Université Libre de Bruxelles).


Yes, named-entity recognition (NER) is fun. 

About a year ago I used a different application to do NER against about 100 
digitized files. From my blog posting [0]:

  name-entity extraction – There was a desire to list the
  underlying names, places, and organizations from each text. These
  things can put a text into a context for the reader. Are there a
  lot of Irish names? Is there a preponderance of place names from
  the United States? To accomplish this task and assist in
  answering these sorts of questions, a Perl script was written
  around the Stanford Named Entity Recognizer. [1] This script
  (txt2ner.pl [2]) extracts the entities, looks them up in DBedia, and
  saves metadata (abstracts, URLs to images, as well as latitudes 
  longitudes) describing the entities to a locally defined XML file
  for later processing. (See an example. [3]) A CGI script (ner.cgi [4])
  was then written to provide a reader-interface to these files.

Once I NER'ed the files and saved the corresponding linked data, I was able 
to create a tablet-based interface providing the means for the reader to see 
how the words are used in context, but also read a blurb from wikipedia as well 
as map places via Google Maps. For example, some extracts from a book called An 
adventure With The Apaches [5] but the data is not as clean as I would hope. 
The whole thing was a part of a project we called the Catholic Youth Literature 
Project. [6]

The ELIS software looks pretty interesting. [7]

Fun with distant reading and NER.


[0] blog postding - http://blogs.nd.edu/emorgan/2012/03/cyl/
[1] Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] txt2ner.pl - http://dh.crc.nd.edu/sandbox/cyl/bin/txt2ner.pl
[3] intermediate XML file - 
http://dh.crc.nd.edu/sandbox/cyl/corpus/advicetoirishgir00cusa.ner
[4] CGI script - http://dh.crc.nd.edu/sandbox/cyl/bin/ner-cgi.pl
[5] Adventure - 
http://dh.crc.nd.edu/sandbox/cyl/catalog/details/adventurewithapa00ferriala.html
[6] Catholic Youth Literature - http://dh.crc.nd.edu/sandbox/cyl/catalog/
[7] ELIS - http://freeyourmetadata.org/named-entity-extraction/

--
Eric Lease Morgan
University of Notre Dame

574/631-8604