What is the link to the downloadable LCNAF data? -- Jean On Mon, 29 Sep 2014, Kyle Banerjee wrote:
KB> IMO, API isn't the best tool for this job. My inclination would be to just KB> download the LCNAF data, normalize source and comparison data, and then KB> compare via hash. KB> KB> That will be easier to write, and you'll be able to do thousands of KB> comparisons per second. KB> KB> kyle KB> KB> On Mon, Sep 29, 2014 at 8:24 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote: KB> KB> > For yet another data set and API that may or may not meet your needs, KB> > consider VIAF -- Virtual International Authority File, operated by OCLC. KB> > KB> > The VIAF's dataset includes the LC NAF as well as other national authority KB> > files, I'm not sure if the API is suitable to limiting matches to the LC KB> > NAF, I haven't done much work with it, but I know it has an API. KB> > KB> > http://oclc.org/developer/develop/web-services/viaf.en.html KB> > KB> > KB> > On 9/29/14 10:18 AM, Trail, Nate wrote: KB> > KB> >> The ID.loc.gov site has a good known label service described here under KB> >> "known label retrieval" : KB> >> http://id.loc.gov/techcenter/searching.html KB> >> KB> >> Use Curl and content negotiation to avoid screen scraping, for example, KB> >> for LC Name authorities: KB> >> KB> >> curl -L -H "Accept: application/rdf+xml" "http://id.loc.gov/ KB> >> authorities/names/label/Library%20of%20Congress" KB> >> KB> >> Nate KB> >> KB> >> ========================== KB> >> Nate Trail KB> >> LS/TECH/NDMSO KB> >> Library of Congress KB> >> n...@loc.gov KB> >> KB> >> KB> >> -----Original Message----- KB> >> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of KB> >> Simon Brown KB> >> Sent: Monday, September 29, 2014 9:38 AM KB> >> To: CODE4LIB@LISTSERV.ND.EDU KB> >> Subject: Re: [CODE4LIB] Reconciling corporate names? KB> >> KB> >> You could always web scrape, or download and then search the LCNAF with KB> >> some script that looks like: KB> >> KB> >> #Build query for webscraping KB> >> query = paste("http://id.loc.gov/search/?q=", URLencode("corporate name KB> >> here "), "&q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames") KB> >> KB> >> #Make the call KB> >> result = readLines(query) KB> >> KB> >> #Find the lines containing "Corporate Name" KB> >> lines = grep("Corporate Name, result) KB> >> KB> >> #Alternatively use approximate string matching on the downloaded LCNAF KB> >> data query <- agrep("corporate name here",LCNAF_data_here) KB> >> KB> >> #Parse for whatever info you want KB> >> ... KB> >> KB> >> My native programming language is R so I hope the functions like paste, KB> >> readLines, grep, and URLencode are generic enough for other languages to KB> >> have some kind of similar thing. This can just be wrapped up into a for KB> >> loop: KB> >> for(i in 1:40000){...} KB> >> KB> >> Web scraping the results of one name at a time would be SLOW and KB> >> obviously using an API is the way to go but it didn't look like the OCLC KB> >> LCNAF API handled Corporate Name. However, it sounds like in the previous KB> >> message someone found a work around. Best of luck! -Simon KB> >> KB> >> KB> >> KB> >> KB> >> KB> >> KB> >> On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers <mcarr...@umich.edu> KB> >> wrote: KB> >> KB> >> Hi Patrick, KB> >>> KB> >>> Over the last few weeks I've been doing something very similar. I was KB> >>> able to figure out a process that works using OpenRefine. It works by KB> >>> searching the VIAF API first, limiting results to anything that is a KB> >>> corporate name and has an LC source authority. OpenRefine then KB> >>> extracts the LCCN and puts that through the LCNAF API that OCLC has to KB> >>> get the name. I had to use VIAF for the initial name search because KB> >>> for some reason the LCNAF API doesn't really handle corporate names as KB> >>> search terms very well, but works with the LCCN just fine (there is KB> >>> the possibility that I'm just doing something wrong, and if that's the KB> >>> case, anyone on the list can feel free to correct me). In the end, KB> >>> you get the LC name authority that corresponds to your search term and KB> >>> a link to the authority on the LC Authorities website. KB> >>> KB> >>> Anyway, The process is fairly simple to run (just prepare an Excel KB> >>> spreadsheet and paste JSON commands into OpenRefine). The only KB> >>> reservation is that I don't think it will run all 40,000 of your names KB> >>> at once. I've been using it to run 300-400 names at a time. That KB> >>> said, I'd be happy to share what I did with you if you'd like to try KB> >>> it out. I have some instructions written up in a Word doc, and the KB> >>> JSON script is in a text file, so just email me off list and I can send KB> >>> them to you. KB> >>> KB> >>> Matt KB> >>> KB> >>> Matt Carruthers KB> >>> Metadata Projects Librarian KB> >>> University of Michigan KB> >>> 734-615-5047 KB> >>> mcarr...@umich.edu KB> >>> KB> >>> On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson KB> >>> <karen.han...@ithaka.org> KB> >>> wrote: KB> >>> KB> >>> I found the WorldCat Identities API useful for an institution name KB> >>>> disambiguation project that I worked on a few years ago, though my KB> >>>> goal wasn't to confirm whether names mapped to LCNAF. The API KB> >>>> response KB> >>>> KB> >>> includes KB> >>> KB> >>>> a LCCN, and you can set it to fuzzy or exact matching, but you would KB> >>>> need to write a script to pass each term in and process the results: KB> >>>> KB> >>>> KB> >>>> http://oclc.org/developer/develop/web-services/worldcat-identities.en. KB> >>> html KB> >>> KB> >>>> KB> >>>> I also can't speak to whether all LC Name Authorities are KB> >>>> represented, so there may be a chance of some false negatives. KB> >>>> KB> >>>> OCLC has another API, but not sure if it covers corporate names: KB> >>>> https://platform.worldcat.org/api-explorer/LCNAF KB> >>>> KB> >>>> I suspect there are others on the list that know more about the KB> >>>> inner workings of these APIs if this might be an option for you... KB> >>>> :) KB> >>>> KB> >>>> Karen KB> >>>> KB> >>>> -----Original Message----- KB> >>>> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf KB> >>>> Of Ethan Gruber KB> >>>> Sent: Friday, September 26, 2014 3:54 PM KB> >>>> To: CODE4LIB@LISTSERV.ND.EDU KB> >>>> Subject: Re: [CODE4LIB] Reconciling corporate names? KB> >>>> KB> >>>> I would check with the developers of SNAC ( KB> >>>> http://socialarchive.iath.virginia.edu/), as they've spent a lot of KB> >>>> time developing named entity recognition scripts for personal and KB> >>>> corporate names. They might have something you can reuse. KB> >>>> KB> >>>> Ethan KB> >>>> KB> >>>> On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick < KB> >>>> KB> >>> pgalli...@rockarch.org KB> >>> KB> >>>> KB> >>>>> wrote: KB> >>>> KB> >>>> I'm looking to reconcile about 40,000 corporate names against KB> >>>>> LCNAF to see whether they are authorized strings or not, but I'm KB> >>>>> drawing a blank about how to get it done. KB> >>>>> KB> >>>>> I've used http://freeyourmetadata.org/ for reconciling subject KB> >>>>> headings before, but I can't get it to work for LCNAF. Has anyone KB> >>>>> had any experience in a project like this? I'd love to hear some KB> >>>>> ideas for automatically dealing with a large data set like this KB> >>>>> that we did not create and do not know how the names were created. KB> >>>>> KB> >>>>> Thanks! KB> >>>>> KB> >>>>> -Patrick Galligan KB> >>>>> KB> >>>>> KB> >>>> KB> >>> KB> >> KB> >> KB> >> -- KB> >> Simon Brown KB> >> simoncbr...@gmail.com KB> >> simoncharlesbrown (Skype) KB> >> 831.440.7466 (Phone) KB> >> KB> >> *Following our will and wind we may just go where no one's been -- MJK* KB> >> KB> >> KB> >> KB>