Re: [CODE4LIB] dealing with Summon
> So, to solve the conundrum: only PublicationDate_xml and > PublicationDate are of interest. If the former is given, use it and > print (if available) its .month, .day, and .year fields. Else, if the > latter is given, just print it. > Ignore all other date-related fields. Ignore PublicationDate_xml.text. > Ignore if there's more than one date field - use the first one. Like I said, I don't claim that this is the one and only correct way to handle the data... but you've correctly described what we're doing in VuFind, and so far nobody has complained about it! - Demian
Re: [CODE4LIB] dealing with Summon
On Wed, Mar 2, 2011 at 11:54 AM, Demian Katz wrote: >> These are the questions I'm seeking answers to; I know that those of >> you who have coded their own Summon front-ends must have faced the >> same questions when implementing their record displays. > > Feel free to refer to VuFind's Summon template for reference if that is > helpful: > > https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/web/interface/themes/default/Summon/record.tpl > > Andrew wrote this originally, and I've tweaked it in a few places to address > problems as they arose. I don't claim that this offers the definitive answer > to your questions... but it's working reasonably well for us so far. > Ah, thanks. As they say, a piece of code speaks a thousand words! So, to solve the conundrum: only PublicationDate_xml and PublicationDate are of interest. If the former is given, use it and print (if available) its .month, .day, and .year fields. Else, if the latter is given, just print it. Ignore all other date-related fields. Ignore PublicationDate_xml.text. Ignore if there's more than one date field - use the first one. This knowledge will also help me avoid sending unnecessary data to the LibX client. As you know, Summon requires a proxy that talks to the actual service, and cutting out redundant and derived fields at the proxy could save a fair amount of bandwidth (though I'll have to check if it also shaves off latency.) A typical search response (raw JSON, with 20 hits) is > 500KB long, so investing computing time at the proxy in cutting this down may be promising. - Godmar
Re: [CODE4LIB] dealing with Summon
Yes, the draft version of SRU 2.0 does include support for facets. The functionality is based on the SOLR documentation of facets with perhaps some slight simplification. None of the editors of the standard are active facet users, so comments on that feature in our draft would be appreciated. (I'm afraid I'm responsible for that work. Personally, I found the SOLR functionality massively over-engineered and hope someone will recommend simplification.) All the current draft documentation is available at http://www.loc.gov/standards/sru/oasis/. Ralph > -Original Message- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Karen Coombs > Sent: Wednesday, March 02, 2011 12:45 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] dealing with Summon > > I believe that there has been discussion of adding facets to SRU > responses in the past. It may even be part of the standard now I'm not > sure. > > Facets in an SRU and/or Atom response would certainly be of interest > to OCLC. Another area where it might be nice to consider collaborating > is on a format for these records that is non-library developer > friendly but rich enough to provide the appropriate metadata. If > you've used WorldCat Search API you'll know that as a developer your > caught between the complexity of MARC and the simplicity (but lack of > richness) of Dublin Core/Atom/RSS. > > Is there a middle ground metadata format that developers would prefer > to see output? > > Karen > > On Tue, Mar 1, 2011 at 1:36 PM, Andrew Nagy wrote: > > Hi Godmar - to help answer some of your questions about the fields - I can > > help address those directly. Though it would be interesting to hear > > experiences from others who are working from APIs to search systems such > as > > Summon or others. > > > > In regards to the publication date - the Summon API has the "raw date" > > (which comes directly from the content provider), but we also provide a > > field with a microformat containing the parsed and cleaned date that Summon > > has generated. We advise for you to use our parsed and cleaned date rather > > than the raw date. The URL and URI fields are similar, the URL is the link > > that we have generated - the URI is what is provided by the content > > provider. In your case, you appear to be referring to OPAC records, so the > > URI is the ToC that came from the 856$u field in your MARC records. The > URL > > is a link to the record in the OPAC. > > > > If you need more assistance around the fields that are available via Summon, > > I'd be happy to take this conversation off-list. > > > > I think an interesting conversation for the Code4Lib community would be > > around a standardized approach for an API that meets both the needs of the > > library developer and the product vendor. I recall a brief chat I had with > > Annette about this same topic at a NISO conference in Boston a while back. > > For example, we have SRU/W, but that does not provide support for all of the > > features that a search engine would need (ie. facets, spelling corrections, > > recommendations, etc.). Maybe a new standard is needed - or maybe > extending > > an existing one would solve this need? I'm all ears if you have any ideas. > > > > Andrew > > > > > > On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back wrote: > > > >> Hi - > >> > >> this is a comment/question about a particular discovery system > >> (Summon), but perhaps of more general interest. It's not intended as > >> flamebait or criticism of the vendor or people associated with it. > >> > >> When integrating Summon into LibX (which works quite nicely btw, > >> gratuitous screenshot is attached to this email) I found myself amazed > >> by the multitude of possible fields and combinations returned in the > >> resulting records. For instance, some records contains fields 'url' > >> (lower case), and/or 'URL' (upper case), and/or 'URI' (upper case). > >> Which one to display, and how? For instance, some records contain an > >> OPAC URL in the 'url' field, and a ToC link in the URI field. Why? > >> > >> Similarly, the date associated with a record can come in a variety of > >> formats. Some are single-field (20080901), some are abbreviated > >> (200811), some are separated into year, month, date, etc. Some > >> records have a mixture of those. > >> > >> My question is how do other adopters of Summon, or of emerging > >> discovery systems that provide direct access to their records in > >> general, deal with the roughness of the records being returned? Are > >> there best practices in how to extract information from them, and in > >> how to prioritize relevant and weed out irrelevant or redundant > >> information? > >> > >> - Godmar > >> > >
Re: [CODE4LIB] dealing with Summon
On Mar 2, 2011, at 12:22 PM, Ed Summers wrote: > Oh, and I think it's great to see this thread on code4lib, where other > people have been known to create an API or three. So thanks Godmar, > for asking here... I concur. I hope others more or less feel comfortable discussing product-specific issues on Code4Lib. Such discussions have more things in common than differences. -- Eric Morgan University of Notre Dame
Re: [CODE4LIB] dealing with Summon
I believe that there has been discussion of adding facets to SRU responses in the past. It may even be part of the standard now I'm not sure. Facets in an SRU and/or Atom response would certainly be of interest to OCLC. Another area where it might be nice to consider collaborating is on a format for these records that is non-library developer friendly but rich enough to provide the appropriate metadata. If you've used WorldCat Search API you'll know that as a developer your caught between the complexity of MARC and the simplicity (but lack of richness) of Dublin Core/Atom/RSS. Is there a middle ground metadata format that developers would prefer to see output? Karen On Tue, Mar 1, 2011 at 1:36 PM, Andrew Nagy wrote: > Hi Godmar - to help answer some of your questions about the fields - I can > help address those directly. Though it would be interesting to hear > experiences from others who are working from APIs to search systems such as > Summon or others. > > In regards to the publication date - the Summon API has the "raw date" > (which comes directly from the content provider), but we also provide a > field with a microformat containing the parsed and cleaned date that Summon > has generated. We advise for you to use our parsed and cleaned date rather > than the raw date. The URL and URI fields are similar, the URL is the link > that we have generated - the URI is what is provided by the content > provider. In your case, you appear to be referring to OPAC records, so the > URI is the ToC that came from the 856$u field in your MARC records. The URL > is a link to the record in the OPAC. > > If you need more assistance around the fields that are available via Summon, > I'd be happy to take this conversation off-list. > > I think an interesting conversation for the Code4Lib community would be > around a standardized approach for an API that meets both the needs of the > library developer and the product vendor. I recall a brief chat I had with > Annette about this same topic at a NISO conference in Boston a while back. > For example, we have SRU/W, but that does not provide support for all of the > features that a search engine would need (ie. facets, spelling corrections, > recommendations, etc.). Maybe a new standard is needed - or maybe extending > an existing one would solve this need? I'm all ears if you have any ideas. > > Andrew > > > On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back wrote: > >> Hi - >> >> this is a comment/question about a particular discovery system >> (Summon), but perhaps of more general interest. It's not intended as >> flamebait or criticism of the vendor or people associated with it. >> >> When integrating Summon into LibX (which works quite nicely btw, >> gratuitous screenshot is attached to this email) I found myself amazed >> by the multitude of possible fields and combinations returned in the >> resulting records. For instance, some records contains fields 'url' >> (lower case), and/or 'URL' (upper case), and/or 'URI' (upper case). >> Which one to display, and how? For instance, some records contain an >> OPAC URL in the 'url' field, and a ToC link in the URI field. Why? >> >> Similarly, the date associated with a record can come in a variety of >> formats. Some are single-field (20080901), some are abbreviated >> (200811), some are separated into year, month, date, etc. Some >> records have a mixture of those. >> >> My question is how do other adopters of Summon, or of emerging >> discovery systems that provide direct access to their records in >> general, deal with the roughness of the records being returned? Are >> there best practices in how to extract information from them, and in >> how to prioritize relevant and weed out irrelevant or redundant >> information? >> >> - Godmar >> >
Re: [CODE4LIB] dealing with Summon
Sorry, it wasn't my intention to derail the conversation, or anything. Just wanted to find out -- for my own purposes -- if there is also a Summon listserv. I'll go Google that, though. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Godmar Back [god...@gmail.com] Sent: Wednesday, March 02, 2011 8:38 AM To: Code for Libraries Cc: Walker, David Subject: Re: [CODE4LIB] dealing with Summon On Wed, Mar 2, 2011 at 11:36 AM, Walker, David wrote: > Just out of curiosity, is there a Summon (API) developer listserv? Should > there be? Yes, there is - I'm waiting for my subscription there to be approved. Like I said at the beginning of this thread, this is only tangentially a Code4Lib issue, and certainly the details aren't. But perhaps the general problem is (?) - Godmar
Re: [CODE4LIB] dealing with Summon
On Wed, Mar 2, 2011 at 11:38 AM, Godmar Back wrote: > Like I said at the beginning of this thread, this is only tangentially > a Code4Lib issue, and certainly the details aren't. But perhaps the > general problem is (?) More than anything this seems like a documentation issue. From my seat in the peanut gallery it seems like Godmar should be able to answer these sorts of questions by looking at the Summon Search API Documentation [1] for responses (which is quite nice btw). Oh, and I think it's great to see this thread on code4lib, where other people have been known to create an API or three. So thanks Godmar, for asking here... //Ed [1] http://api.summon.serialssolutions.com/help/api/search/response
Re: [CODE4LIB] dealing with Summon
> These are the questions I'm seeking answers to; I know that those of > you who have coded their own Summon front-ends must have faced the > same questions when implementing their record displays. Feel free to refer to VuFind's Summon template for reference if that is helpful: https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/web/interface/themes/default/Summon/record.tpl Andrew wrote this originally, and I've tweaked it in a few places to address problems as they arose. I don't claim that this offers the definitive answer to your questions... but it's working reasonably well for us so far. - Demian
Re: [CODE4LIB] dealing with Summon
Just out of curiosity, is there a Summon (API) developer listserv? Should there be? --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Godmar Back [god...@gmail.com] Sent: Wednesday, March 02, 2011 8:30 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] dealing with Summon On Wed, Mar 2, 2011 at 11:12 AM, Roy Tennant wrote: > Godmar, > I'm surprised you're asking this. Most of the questions you want > answered could be answered by a basic programming construct: an > if-then-else statement and a simple decision about what you want to > use in your specific application (for example, do you prefer "text" > with the period, or not?). About the only question that such a > solution wouldn't deal with is "which fields are derived from which > others", which strikes me as superfluous to your application if you > know a hierarchy of preference. But perhaps I'm missing something > here. I'm not asking how to code it, I'm asking for the algorithm I should use, given the fact that I'm not familiar with the provenance and status of the data Summon returns (which, I understand, is a mixture of original, harvested data, and "cleaned-up", processed data.) Can you suggest such an algorithm, given the fact that each of the 8 elements I showed in the example (PublicationDateYear, PublicationDateDecade, PublicationDate, PublicationDateCentury, PublicationDate_xml.text, PublicationDate_xml.day, PublicationDate_xml.month, PublicationDate_xml.year is optional? But wait I think I've also seen records where there is a PublicationDateMonth, and records where some values have arrays of length > 1. Can you suggest, or at least outline, such an algorithm? It would be helpful to know, for instance, if the presence of a PublicationDate_xml field supplants any other PublicationDate* fields (does it?) If a PublicationDate_xml field is absent, which field would I want to look at next? Is PublicationDate more reliable than a combination of PublicationDateYear and PublicationDateMonth (and perhaps PublicationDateDay if it exists?)? If the PublicationDate_xml is present, then: should I prefer the .text option? What's the significance of that dot? Is it spurious, like the identifier you mentioned you find in raw MARC records? If not, what, if anything, is known about the presence of the other fields? What if multiple fields are given in an array? Is the ordering significant (e.g., the first one is more trustworthy?) Or should I sort them based on a heuristics? (e.g., if "20100523" and "201005" is given, prefer the former?) What if the data is contradictory? These are the questions I'm seeking answers to; I know that those of you who have coded their own Summon front-ends must have faced the same questions when implementing their record displays. - Godmar
Re: [CODE4LIB] dealing with Summon
On Wed, Mar 2, 2011 at 11:36 AM, Walker, David wrote: > Just out of curiosity, is there a Summon (API) developer listserv? Should > there be? Yes, there is - I'm waiting for my subscription there to be approved. Like I said at the beginning of this thread, this is only tangentially a Code4Lib issue, and certainly the details aren't. But perhaps the general problem is (?) - Godmar
Re: [CODE4LIB] dealing with Summon
On Wed, Mar 2, 2011 at 11:12 AM, Roy Tennant wrote: > Godmar, > I'm surprised you're asking this. Most of the questions you want > answered could be answered by a basic programming construct: an > if-then-else statement and a simple decision about what you want to > use in your specific application (for example, do you prefer "text" > with the period, or not?). About the only question that such a > solution wouldn't deal with is "which fields are derived from which > others", which strikes me as superfluous to your application if you > know a hierarchy of preference. But perhaps I'm missing something > here. I'm not asking how to code it, I'm asking for the algorithm I should use, given the fact that I'm not familiar with the provenance and status of the data Summon returns (which, I understand, is a mixture of original, harvested data, and "cleaned-up", processed data.) Can you suggest such an algorithm, given the fact that each of the 8 elements I showed in the example (PublicationDateYear, PublicationDateDecade, PublicationDate, PublicationDateCentury, PublicationDate_xml.text, PublicationDate_xml.day, PublicationDate_xml.month, PublicationDate_xml.year is optional? But wait I think I've also seen records where there is a PublicationDateMonth, and records where some values have arrays of length > 1. Can you suggest, or at least outline, such an algorithm? It would be helpful to know, for instance, if the presence of a PublicationDate_xml field supplants any other PublicationDate* fields (does it?) If a PublicationDate_xml field is absent, which field would I want to look at next? Is PublicationDate more reliable than a combination of PublicationDateYear and PublicationDateMonth (and perhaps PublicationDateDay if it exists?)? If the PublicationDate_xml is present, then: should I prefer the .text option? What's the significance of that dot? Is it spurious, like the identifier you mentioned you find in raw MARC records? If not, what, if anything, is known about the presence of the other fields? What if multiple fields are given in an array? Is the ordering significant (e.g., the first one is more trustworthy?) Or should I sort them based on a heuristics? (e.g., if "20100523" and "201005" is given, prefer the former?) What if the data is contradictory? These are the questions I'm seeking answers to; I know that those of you who have coded their own Summon front-ends must have faced the same questions when implementing their record displays. - Godmar
Re: [CODE4LIB] dealing with Summon
Godmar, I'm surprised you're asking this. Most of the questions you want answered could be answered by a basic programming construct: an if-then-else statement and a simple decision about what you want to use in your specific application (for example, do you prefer "text" with the period, or not?). About the only question that such a solution wouldn't deal with is "which fields are derived from which others", which strikes me as superfluous to your application if you know a hierarchy of preference. But perhaps I'm missing something here. Roy On Wed, Mar 2, 2011 at 7:39 AM, Godmar Back wrote: > On Tue, Mar 1, 2011 at 11:14 PM, Roy Tennant wrote: >>> On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back wrote: >>> >>>Similarly, the date associated with a record can come in a variety of >>>formats. Some are single-field (20080901), some are abbreviated >>>(200811), some are separated into year, month, date, etc. Some >>>records have a mixture of those. >> >> In this world of MARC (s/MARC/hurt) I call that an embarrassment of >> riches. I've spent some bit of time parsing MARC, especially lately, >> and just the fact that Summon provides a normalized date element is >> HUGE. > > That's great to hear - but how do I know which elements to use? > > For instance, look at the JSON excerpt at > http://api.summon.serialssolutions.com/help/api/search/response/documents > > "PublicationDateCentury":[ > "1900" > ], > "PublicationDateDecade":[ > "1970" > ], > "PublicationDateYear":[ > "1979" > ], > "PublicationDate":[ > "1979." > ], > "PublicationDate_xml":[ > { > "day":"01", > "month":"01", > "text":"1979.", > "year":"1979" > } > ], > > Which one is the cleaned up date, and in which order shall I be > looking for the date field in the record when some or all of this > information is missing in a particular record? > > Andrew responded to that if given, PublicationDate_xml is the > preferred one - but this raises the question which field in > PublicationDate_xml to use: .text, .day, or .year? What if some are > missing? > What if PublicationDate_xml is missing, then I use or look for > PublicationDate? Or is PublicationDateYear/Month/Decade preferred to > PublicationDate? Which fields are derived from which others? > > These are the types of questions I'm looking to answer. > > - Godmar >
Re: [CODE4LIB] dealing with Summon
On Tue, Mar 1, 2011 at 11:14 PM, Roy Tennant wrote: >> On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back wrote: >> >>Similarly, the date associated with a record can come in a variety of >>formats. Some are single-field (20080901), some are abbreviated >>(200811), some are separated into year, month, date, etc. Some >>records have a mixture of those. > > In this world of MARC (s/MARC/hurt) I call that an embarrassment of > riches. I've spent some bit of time parsing MARC, especially lately, > and just the fact that Summon provides a normalized date element is > HUGE. That's great to hear - but how do I know which elements to use? For instance, look at the JSON excerpt at http://api.summon.serialssolutions.com/help/api/search/response/documents "PublicationDateCentury":[ "1900" ], "PublicationDateDecade":[ "1970" ], "PublicationDateYear":[ "1979" ], "PublicationDate":[ "1979." ], "PublicationDate_xml":[ { "day":"01", "month":"01", "text":"1979.", "year":"1979" } ], Which one is the cleaned up date, and in which order shall I be looking for the date field in the record when some or all of this information is missing in a particular record? Andrew responded to that if given, PublicationDate_xml is the preferred one - but this raises the question which field in PublicationDate_xml to use: .text, .day, or .year? What if some are missing? What if PublicationDate_xml is missing, then I use or look for PublicationDate? Or is PublicationDateYear/Month/Decade preferred to PublicationDate? Which fields are derived from which others? These are the types of questions I'm looking to answer. - Godmar
Re: [CODE4LIB] dealing with Summon
> On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back wrote: > >Similarly, the date associated with a record can come in a variety of >formats. Some are single-field (20080901), some are abbreviated >(200811), some are separated into year, month, date, etc. Some >records have a mixture of those. In this world of MARC (s/MARC/hurt) I call that an embarrassment of riches. I've spent some bit of time parsing MARC, especially lately, and just the fact that Summon provides a normalized date element is HUGE. That potentially takes that load off of your application, should it be forced to absorb native MARC. This is the old Garbage In/Garbage Out (GIGO) issue. Case in point: just today I discovered that we have at least 1,600 856 fields in Worldcat that have the pipe symbol "|" in the second indicator position instead of the numeral one. Right, that means there are rocket scientists who thought the documentation was indiicating a pipe symbol in that position. We then have granularity issues, punctuation issues, and variance in practice. And that's just for starters. Huge props to Summon for trying to tackle some of these things, as we are attempting to do as well. Roy
Re: [CODE4LIB] dealing with Summon
Hi Godmar - to help answer some of your questions about the fields - I can help address those directly. Though it would be interesting to hear experiences from others who are working from APIs to search systems such as Summon or others. In regards to the publication date - the Summon API has the "raw date" (which comes directly from the content provider), but we also provide a field with a microformat containing the parsed and cleaned date that Summon has generated. We advise for you to use our parsed and cleaned date rather than the raw date. The URL and URI fields are similar, the URL is the link that we have generated - the URI is what is provided by the content provider. In your case, you appear to be referring to OPAC records, so the URI is the ToC that came from the 856$u field in your MARC records. The URL is a link to the record in the OPAC. If you need more assistance around the fields that are available via Summon, I'd be happy to take this conversation off-list. I think an interesting conversation for the Code4Lib community would be around a standardized approach for an API that meets both the needs of the library developer and the product vendor. I recall a brief chat I had with Annette about this same topic at a NISO conference in Boston a while back. For example, we have SRU/W, but that does not provide support for all of the features that a search engine would need (ie. facets, spelling corrections, recommendations, etc.). Maybe a new standard is needed - or maybe extending an existing one would solve this need? I'm all ears if you have any ideas. Andrew On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back wrote: > Hi - > > this is a comment/question about a particular discovery system > (Summon), but perhaps of more general interest. It's not intended as > flamebait or criticism of the vendor or people associated with it. > > When integrating Summon into LibX (which works quite nicely btw, > gratuitous screenshot is attached to this email) I found myself amazed > by the multitude of possible fields and combinations returned in the > resulting records. For instance, some records contains fields 'url' > (lower case), and/or 'URL' (upper case), and/or 'URI' (upper case). > Which one to display, and how? For instance, some records contain an > OPAC URL in the 'url' field, and a ToC link in the URI field. Why? > > Similarly, the date associated with a record can come in a variety of > formats. Some are single-field (20080901), some are abbreviated > (200811), some are separated into year, month, date, etc. Some > records have a mixture of those. > > My question is how do other adopters of Summon, or of emerging > discovery systems that provide direct access to their records in > general, deal with the roughness of the records being returned? Are > there best practices in how to extract information from them, and in > how to prioritize relevant and weed out irrelevant or redundant > information? > > - Godmar >