Thank you Roy and Simon for the info. As for your second point, I suppose one advantage of using the WorldCat API at this experimental stage is that the returned bib records are already FRBR-ized.
Ross - Thanks for the link of Open Library data dump. WorldCat collection is 2 orders of magnitude larger than open library which makes a significant difference considering the skewness and sparsity of bib records classified according to library taxonomies, e.g., DDC, LCC (for more info, see: http://cdm15003.contentdm.oclc.org/cdm/singleitem/collection/p267701coll 27/id/277/rec/28) Thanks, Arash -----Original Message----- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero Sent: 22 May 2012 19:47 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set Arash - you might not want to use a straight dump of worldcat catalog records- at least not without the associated holdings information.* There are a lot of quasi-duplicate records that are sufficiently broken that the worldcat de-duplication algorithm refuses to merge them. These records will usually only be used by a handful of institutions; the better records will tend to have more associated holdings. The holdings count should be used to weight the strength of association between class numbers and features. Also, since classification/categorization is something that is usually considered to be a property of works, rather than manifestations, one might get better results by using Work sets for training. I would suggest, er, contacting Thom Hickey. Simon * Well, not precisely holdings - you just need the number of distinct institutions with at least one copy. I call them 'hasings'. On Sat, May 19, 2012 at 8:42 PM, Roy Tennant <roytenn...@gmail.com> wrote: > Arash, > Yes, we have made WorldCat available to researchers under a special > license agreement. I suggest contacting Thom Hickey<hic...@oclc.org> > about such an arrangement. Thanks, > Roy > > On Fri, May 18, 2012 at 3:46 AM, Arash.Joorabchi <arash.joorab...@ul.ie> > wrote: > > Dear Karen, > > > > I am conducting a research experiment on automatic text classification > and I am trying to retrieve top matching bib records (which include DDC > fields) for a set of keyphrases extracted from a given document. So, I > suppose this is a rather exceptional use case. In fact, the right approach > for this experiment is to process the full dump of WorldCat database > directly rather than sending a limited number of queries via the API. > > > > I read here: > > http://dltj.org/article/worldcat-lld-may-become-available under-odc-by/ > > that WorldCat might become available as open linked data in future, > which would solve my problem and help similar text mining projects. > However, I wonder if it is currently available to researchers under a > research/non-commercial use license agreement. > > > > Regards, > > Arash > > > > -----Original Message----- > > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Karen Coombs > > Sent: 17 May 2012 08:37 > > To: CODE4LIB@LISTSERV.ND.EDU > > Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records > without a DDC no from the result set > > > > I forwarded this thread to the Product Manager for the WorldCat Search > > API. She responded back that unfortunately this query is not possible > > using the API at this time. > > > > FYI, the SRU interface to WorldCat Search API doesn't currently > > support any scan type searches either. > > > > Is there a particular use case you're trying to support? Know that > > would help us document this as a possible enhancement. > > > > Karen > > > > Karen Coombs > > Senior Product Analyst > > Web Services > > OCLC > > coom...@oclc.org > > > > On Wed, May 16, 2012 at 9:49 PM, Arash.Joorabchi <arash.joorab...@ul.ie> > wrote: > >> Hi Andy, > >> > >> > >> > >> I am a SRU newbie myself, so I don't know how this could be achieved > >> using scan operations and could not find much info on SRU website > >> (http://www.loc.gov/standards/sru/). > >> > >> As for the wildcards, according to this guide: > >> > http://www.oclc.org/support/documentation/worldcat/searching/refcard/sea > >> rchworldcatquickreference.pdf the symbols should be preceded by at least > >> 3 characters, and therefore clauses like: > >> > >> > >> > >> ... AND srw.dd=* > >> > >> ... AND srw.dd=?.* > >> > >> ... AND srw/dd=###.* > >> > >> ... AND srw/dd=?3.* > >> > >> > >> > >> > >> > >> do not work and result in the following error: > >> > >> Diagnostics > >> > >> Identifier: > >> > >> info:srw/diagnostic/1/9 > >> > >> Meaning: > >> > >> > >> > >> Details: > >> > >> > >> > >> Message: > >> > >> Not enough chars in truncated term:Truncated words too short(9) > >> > >> > >> > >> > >> > >> Thanks, > >> > >> Arash > >> > >> > >> > >> ________________________________ > >> > >> From: Houghton,Andrew [mailto:hough...@oclc.org] > >> Sent: 16 May 2012 11:58 > >> To: Arash.Joorabchi > >> Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records > >> without a DDC no from the result set > >> > >> > >> > >> I'm not an SRU guru, but is it possible to do a scan and look for a > >> postings of zero? > >> > >> > >> > >> Andy. > >> > >> On May 16, 2012, at 6:39, "Arash.Joorabchi" <arash.joorab...@ul.ie> > >> wrote: > >> > >> Hi mark, > >> > >> Srw.dd=* does not work either: > >> > >> Identifier: info:srw/diagnostic/1/27 > >> Meaning: > >> Details: srw.dd > >> Message: The index [srw.dd] did not include a searchable > >> value > >> > >> I suppose the only option left is to retrieve everything and > >> filter the results on the client side. > >> > >> Thanks for your quick reply. > >> Arash > >> > >> > >> -----Original Message----- > >> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On > >> Behalf Of Mike Taylor > >> Sent: 16 May 2012 10:43 > >> To: CODE4LIB@LISTSERV.ND.EDU > >> Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of > >> records without a DDC no from the result set > >> > >> There is no standard way in CQL to express "field X is not > >> empty". > >> Depending on implementations, NOT srw.dd="" might work (but > >> evidently > >> doesn't in this case). Another possibility is srw.dd=*, but > >> again > >> that may or may not work, and might be appallingly inefficient > >> if it > >> does. NOT srw.dd=null will definitely not work: "null" is not a > >> special word in CQL. > >> > >> -- Mike. > >> > >> > >> On 16 May 2012 10:32, Arash.Joorabchi <arash.joorab...@ul.ie> > >> wrote: > >> > Hi all, > >> > > >> > I am sending SRU queries to the WorldCat in the following > >> form: > >> > > >> > > >> > String host = > >> > "http://worldcat.org/webservices/catalog/search/"; > >> > String query = "sru?query=srw.kw=\"" + keyword + > >> "\"" > >> > + " AND srw.ln exact \"eng\"" > >> > + " AND srw.mt all \"bks\"" > >> > + " AND srw.nt=\"" + keyword + > >> "\"" > >> > + "&servicelevel=full" > >> > + "&maximumRecords=100" > >> > + "&sortKeys=relevance,,0" > >> > + "&wskey=[wskey]"; > >> > > >> > And it is working fine, however I'd like to limit the results > >> to those > >> > records that have a DDC number assigned to them, but I don't > >> know what's > >> > the right way to specify this limit in the query. > >> > > >> > NOT srw.dd="" > >> > NOT srw.dd=null > >> > > >> > Neither of above work > >> > > >> > > >> > Thanks, > >> > Arash > >> > > >> > >> ________________________________ > >> > >> No virus found in this message. > >> Checked by AVG - www.avg.com > >> Version: 2012.0.2176 / Virus Database: 2425/5001 - Release Date: > >> 05/15/12 > > > > ----- > > No virus found in this message. > > Checked by AVG - www.avg.com > > Version: 2012.0.2176 / Virus Database: 2425/5004 - Release Date: 05/16/12 > ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2176 / Virus Database: 2425/5015 - Release Date: 05/22/12