Re: [CODE4LIB] Reconciling corporate names?

Jean Roth Mon, 29 Sep 2014 10:57:13 -0700

What is the link to the downloadable LCNAF data?  --  Jean

On Mon, 29 Sep 2014, Kyle Banerjee wrote:


KB> IMO, API isn't the best tool for this job. My inclination would be to just
KB> download the LCNAF data, normalize source and comparison data, and then
KB> compare via hash.
KB> 
KB> That will be easier to write, and you'll be able to do thousands of
KB> comparisons per second.
KB> 
KB> kyle
KB> 
KB> On Mon, Sep 29, 2014 at 8:24 AM, Jonathan Rochkind <[email protected]> wrote:
KB> 
KB> > For yet another data set and API that may or may not meet your needs,
KB> > consider VIAF -- Virtual International Authority File, operated by OCLC.
KB> >
KB> > The VIAF's dataset includes the LC NAF as well as other national authority
KB> > files, I'm not sure if the API is suitable to limiting matches to the LC
KB> > NAF, I haven't done much work with it, but I know it has an API.
KB> >
KB> > http://oclc.org/developer/develop/web-services/viaf.en.html
KB> >
KB> >
KB> > On 9/29/14 10:18 AM, Trail, Nate wrote:
KB> >
KB> >> The ID.loc.gov site has a good known label service described here under
KB> >> "known label retrieval" :
KB> >> http://id.loc.gov/techcenter/searching.html
KB> >>
KB> >> Use  Curl and content negotiation to avoid screen scraping, for example,
KB> >> for LC Name authorities:
KB> >>
KB> >> curl -L -H "Accept: application/rdf+xml" "http://id.loc.gov/
KB> >> authorities/names/label/Library%20of%20Congress"
KB> >>
KB> >> Nate
KB> >>
KB> >> ==========================
KB> >> Nate Trail
KB> >> LS/TECH/NDMSO
KB> >> Library of Congress
KB> >> [email protected]
KB> >>
KB> >>
KB> >> -----Original Message-----
KB> >> From: Code for Libraries [mailto:[email protected]] On Behalf Of
KB> >> Simon Brown
KB> >> Sent: Monday, September 29, 2014 9:38 AM
KB> >> To: [email protected]
KB> >> Subject: Re: [CODE4LIB] Reconciling corporate names?
KB> >>
KB> >> You could always web scrape, or download and then search the LCNAF with
KB> >> some script that looks like:
KB> >>
KB> >> #Build query for webscraping
KB> >> query = paste("http://id.loc.gov/search/?q=";, URLencode("corporate name
KB> >> here "), "&q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames")
KB> >>
KB> >> #Make the call
KB> >> result = readLines(query)
KB> >>
KB> >> #Find the lines containing "Corporate Name"
KB> >> lines = grep("Corporate Name, result)
KB> >>
KB> >> #Alternatively use approximate string matching on the downloaded LCNAF
KB> >> data query <- agrep("corporate name here",LCNAF_data_here)
KB> >>
KB> >> #Parse for whatever info you want
KB> >> ...
KB> >>
KB> >> My native programming language is R so I hope the functions like paste,
KB> >> readLines, grep, and URLencode are generic enough for other languages to
KB> >> have some kind of similar thing.  This can just be wrapped up into a for
KB> >> loop:
KB> >> for(i in 1:40000){...}
KB> >>
KB> >> Web scraping the results of one name at a time would be SLOW and
KB> >> obviously using an API is the way to go but it didn't look like the OCLC
KB> >> LCNAF API handled Corporate Name.  However, it sounds like in the 
previous
KB> >> message someone found a work around.  Best of luck! -Simon
KB> >>
KB> >>
KB> >>
KB> >>
KB> >>
KB> >>
KB> >> On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers <[email protected]>
KB> >> wrote:
KB> >>
KB> >>  Hi Patrick,
KB> >>>
KB> >>> Over the last few weeks I've been doing something very similar.  I was
KB> >>> able to figure out a process that works using OpenRefine.  It works by
KB> >>> searching the VIAF API first, limiting results to anything that is a
KB> >>> corporate name and has an LC source authority.  OpenRefine then
KB> >>> extracts the LCCN and puts that through the LCNAF API that OCLC has to
KB> >>> get the name.  I had to use VIAF for the initial name search because
KB> >>> for some reason the LCNAF API doesn't really handle corporate names as
KB> >>> search terms very well, but works with the LCCN just fine (there is
KB> >>> the possibility that I'm just doing something wrong, and if that's the
KB> >>> case, anyone on the list can feel free to correct me).  In the end,
KB> >>> you get the LC name authority that corresponds to your search term and
KB> >>> a link to the authority on the LC Authorities website.
KB> >>>
KB> >>> Anyway,  The process is fairly simple to run (just prepare an Excel
KB> >>> spreadsheet and paste JSON commands into OpenRefine).  The only
KB> >>> reservation is that I don't think it will run all 40,000 of your names
KB> >>> at once.  I've been using it to run 300-400 names at a time.  That
KB> >>> said, I'd be happy to share what I did with you if you'd like to try
KB> >>> it out.  I have some instructions written up in a Word doc, and the
KB> >>> JSON script is in a text file, so just email me off list and I can send
KB> >>> them to you.
KB> >>>
KB> >>> Matt
KB> >>>
KB> >>> Matt Carruthers
KB> >>> Metadata Projects Librarian
KB> >>> University of Michigan
KB> >>> 734-615-5047
KB> >>> [email protected]
KB> >>>
KB> >>> On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson
KB> >>> <[email protected]>
KB> >>> wrote:
KB> >>>
KB> >>>  I found the WorldCat Identities API useful for an institution name
KB> >>>> disambiguation project that I worked on a few years ago, though my
KB> >>>> goal wasn't to confirm whether names mapped to LCNAF.  The API
KB> >>>> response
KB> >>>>
KB> >>> includes
KB> >>>
KB> >>>> a LCCN, and you can set it to fuzzy or exact matching, but you would
KB> >>>> need to write a script to pass each term in and process the results:
KB> >>>>
KB> >>>>
KB> >>>>  http://oclc.org/developer/develop/web-services/worldcat-identities.en.
KB> >>> html
KB> >>>
KB> >>>>
KB> >>>> I also can't speak to whether all LC Name Authorities are
KB> >>>> represented, so there may be a chance of some false negatives.
KB> >>>>
KB> >>>> OCLC has another API, but not sure if it covers corporate names:
KB> >>>> https://platform.worldcat.org/api-explorer/LCNAF
KB> >>>>
KB> >>>> I suspect there are others on the list that know more about the
KB> >>>> inner workings of these APIs if this might be an option for you...
KB> >>>> :)
KB> >>>>
KB> >>>> Karen
KB> >>>>
KB> >>>> -----Original Message-----
KB> >>>> From: Code for Libraries [mailto:[email protected]] On Behalf
KB> >>>> Of Ethan Gruber
KB> >>>> Sent: Friday, September 26, 2014 3:54 PM
KB> >>>> To: [email protected]
KB> >>>> Subject: Re: [CODE4LIB] Reconciling corporate names?
KB> >>>>
KB> >>>> I would check with the developers of SNAC (
KB> >>>> http://socialarchive.iath.virginia.edu/), as they've spent a lot of
KB> >>>> time developing named entity recognition scripts for personal and
KB> >>>> corporate names. They might have something you can reuse.
KB> >>>>
KB> >>>> Ethan
KB> >>>>
KB> >>>> On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick <
KB> >>>>
KB> >>> [email protected]
KB> >>>
KB> >>>>
KB> >>>>>  wrote:
KB> >>>>
KB> >>>>  I'm looking to reconcile about 40,000 corporate names against
KB> >>>>> LCNAF to see whether they are authorized strings or not, but I'm
KB> >>>>> drawing a blank about how to get it done.
KB> >>>>>
KB> >>>>> I've used http://freeyourmetadata.org/ for reconciling subject
KB> >>>>> headings before, but I can't get it to work for LCNAF. Has anyone
KB> >>>>> had any experience in a project like this? I'd love to hear some
KB> >>>>> ideas for automatically dealing with a large data set like this
KB> >>>>> that we did not create and do not know how the names were created.
KB> >>>>>
KB> >>>>> Thanks!
KB> >>>>>
KB> >>>>> -Patrick Galligan
KB> >>>>>
KB> >>>>>
KB> >>>>
KB> >>>
KB> >>
KB> >>
KB> >> --
KB> >> Simon Brown
KB> >> [email protected]
KB> >> simoncharlesbrown (Skype)
KB> >> 831.440.7466 (Phone)
KB> >>
KB> >> *Following our will and wind we may just go where no one's been -- MJK*
KB> >>
KB> >>
KB> >>
KB>

Re: [CODE4LIB] Reconciling corporate names?

Reply via email to