Tom,

I am the guy who wrote sushi.py around this time last year. My apologies for 
the shabbiness of the code. It was meant to be primarily a proof of concept. 
It's definitely incomplete. I only completed the DB3 and JR1 report logic up to 
this point, but it would be easy enough to add other report types. You're also 
right that sushi.py doesn't do anything to dedupe data, but it would be very 
simple to write a script that reads through the SQL records and deletes dupes. 
You could also use the built-in UNIQUE flag in MySQL when creating your table 
so that duplicate records just don't get saved. If you use the CSV export 
functionality of sushi.py, Excel has some built-in dedupe features that would 
help as well.

Let me know if you'd like some help modifying sushi.py. I sort of gave up on it 
last spring. SUSHI implementation among vendors is still pretty shabby, and 
there are still some weaknesses in the SUSHI standard (I wrote about them in 
the Nov 2012 issue of Computers in Libraries). The productivity gains I was 
seeing from using SUSHI ended up being pretty low.

Josh Welker


-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tom 
Keays
Sent: Friday, January 25, 2013 8:40 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Anyone have a SUSHI client?

I've been looking briefly at sushi.py, as a way to orient myself to collecting 
stats this way. I'm not intending to single out sushi.py, but looking at it 
(mainly the data structure at this point, and not the code itself), raises some 
questions about the best approach for collecting SUSHI data.

sushi.py seems to have a small number of routines; mainly to retrieve the XML 
file from a vendor and ingest the data in that file into a MySQL database. 
There are only MySQL tables for COUNTER JR1, DR1, DR2, and DR2 reports and they 
mirror, to a degree, the structure of the item records returned in the SUSHI 
xml. Here are the skeletons of 2 of the sushi.py SQL
tables:

counter_jr1
  id int,
  print_issn varchar,
  online_issn varchar,
  platform varchar,
  item_name text,
  data_type varchar,
  date_begin datetime,
  date_end datetime,
  ft_pdf int,
  ft_html int,
  ft_total varchar

counter_db3
  id int,
  platform varchar,
  item_name text,
  data_type varchar,
  date_begin datetime,
  date_end datetime,
  searches int,
  sessions int

On the face of it, this seems like a pretty good data structure (although I 
have a couple of concerns, that I will get to) but my main question is whether 
there is any agreement about how to collect this data? If I were to dig into 
some of the other SUSHI packages mentioned in this thread, what would I find 
there? Excel-formatted COUNTER reports are simply a table of columns, 
representing various fields, such as title (for JR1), platform, publisher (for 
JR1), ISSN (for JR1), etc., followed by columns for up to 12 months of the 
collected year, and then summary data.  JR1 reports have fulltext HTML, PDF, 
and Total columns. DR1 has two rows, one for searches and one for sesssions, 
with YTD totals in the final column. Similar data structures exist for other 
COUNTER reports. They rely on the user to interpret them and probably ought not 
to inform a decision for structuring the data in a database. Is there been any 
best practice for how COUNTER data is modeled in a database?

There are other COUNTER reports besides those four. For instance, some journal 
vendors do indeed report searches and sessions using the DR3 report, but others 
use the equivalent JR4 report, so I would have expected sushi.py to have a 
mechanism to collect these. Does SUSHI only deliver JR1, DR1, DR2, and DR2 
reports, or is this a problem with sushi.py?

Now, one of the selling points for SUSHI is that if a vendor ever advises that 
you should re-collect data for a given time period, the xml you receive is 
structured such that the act of collecting OUGHT TO update, rather than 
duplicate, data previously collected. However in sushi.py's SQL structure, 
which gives every row a unique (auto-incremented) ID number, there would have 
to be logic applied during the ingest to prevent multiple instances of data 
collected from the same vendor for the same time period.
So, that's a concern.

I'm also concerned about what is represented in the ft_pdf, ft_html, and 
ft_total fields. In the Excel COUNTER reports, the ft_pdf, ft_html, and 
ft_total columns simply tabulate the YTD totals and the only way you would be 
able to derive a monthly breakdown would be to collect 12 monthly reports and 
analyze the differences from month to month -- something that most libraries 
don't do. I have to go back and confirm this, but I don't think the SUSHI 
reports are giving a month-only breakdown for those fields, so I wonder about 
their inclusion in that table. I guess my question is what is returned in the 
SUSHI xml report: monthly or yearly figures for the ft_pdf, ft_html, and 
ft_total fields?

Tom

Reply via email to