Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-06 Thread Julie Hardesty
Hi Kelley - I conducted that usability test on Scherzo and wrote that
report so I can answer your questions!  I think a work-focused approach can
work for users, but we had to scale back on what we assumed users would
understand on the search results page.  After this test of the system, we
changed the search results interface to identify within the works list how
many scores and recordings contained that work, so the works list looked
more like a facet.  The works list then wasn't just a list of titles, but
was tied more directly to the recordings/scores result list (which is
directly below the works list on the search results page).

I do think that some of the testing results we saw reflected how users are
used to searching for music in traditional catalogs.  While the work is a
key concept for musicians, they may have gotten used to the fact that
searching for or scanning a results list for a work title often isn't easy
(or even possible) in a library catalog so either the title of the album or
a person's name is the real key to finding stuff.  I think that also might
have been part of what threw people off seeing the works listed in the
search results.  They didn't believe they were seeing titles of songs -
they thought they were seeing titles of albums or something that was some
sort of physical item.  They weren't really sure what it was and so they
just skipped that list of things.  So adding the info that, for example, a
work title is found on 5 recordings/scores really helped to identify the
works list as such.

Music is kind of unique within FRBR since several works can be involved in
a single manifestation (recording or score) and a single work can have many
different expressions (different performances by different people of the
same work).  Other types of resources like books and movies don't often
line up with the FRBR model the same way.  I can't say for sure whether or
not the interface we arrived at after this testing (
http://vfrbr.info/scherzo/) could be used for other work-based resources
with a works list serving as a facet to narrow down results, but it seems
to be a good use of the FRBR model.

Here's an example of a search that I think brings out the strength of what
this type of works list can do.  Searching in Scherzo for something like
symphony no. 5 as Keyword results in several works with that same (or
similar) title and lots of recordings and scores that contain expressions
of all of the different symphony no. 5 works.  The facet nature of
showing how many recordings/scores contain that work can help to
distinguish which work is the symphony no. 5 you actually want and helps
identify that works list as a list of symphony no. 5 works by different
composers.

I hope this is helpful - it was an interesting project to test these
FRBRized search concepts and it would be great to see further experiments
with this idea, specifically with non-music resources to see if it can be
applied or not.  Let me know if you have any more questions about what we
did with the Scherzo interface and best of luck on your project!

Julie Hardesty
Metadata Analyst
Metadata Resources  Systems
Library Technologies
Indiana University



On Tue, Dec 3, 2013 at 10:58 PM, Kelley McGrath kell...@uoregon.edu wrote:

 Thanks, Jon. I have seen the Variations work and also talked to Jenn Riley
 about it. It has definitely influenced me, although we are going in a
 slightly different direction and moving images have some different needs
 from music.

 One thing about Variations that struck me is this paragraph from the
 usability testing report (
 http://www.dlib.indiana.edu/projects/vfrbr/projectDoc/usability/usabilityTest/ScherzoUTestReport.pdf
 ):

 There was an assumption among the development team that works would be a
 window for organizing and narrowing results in a way that users searching
 for scores and recordings would find useful. One of the main ideas behind
 FRBR is that the work, or the intellectual entity that is produced by
 people and is packaged in many forms, is the core information – Scherzo’s
 interface reflected that organization. 4 (See Appendix E, Fig. 14 for
 Scherzo’s search results page.) But the participants tended to latch on to
 a person’s name and search for that name in a particular role. The reasons
 for this are not completely clear and further discussion follows, but it is
 worth bearing this finding in mind. Additionally, from the search results
 page, work results were clicked only 14 times in comparison to items in
 recordings  scores , which were clicked 65 times. Regardless of how the
 FRBRized data is organized on the back end, the interface needs to reflect
 the way users want to search, and that might not mean with search results
 organized by work.

 Does this mean that a work-focused approach is not actually what users
 want or need? Does it mean that the work-centered approach needs to be
 implemented differently in the user interface? Are these results somehow
 

Re: [CODE4LIB] Looking for two coders to help with discoverability of videos - FRBR

2013-12-06 Thread Notess, Mark
 Does this mean that a work-focused approach is not actually what users
 want or need? Does it mean that the work-centered approach needs to be
 implemented differently in the user interface? Are these results somehow
 specific to music? Do they reflect users' familiarity with the typical
 library catalog and the strategies they've become accustomed to using?

FRBR is a wonderful model of our corner of reality. But users aren¹t
model-oriented, they are task oriented. They are trying to get stuff done.
So the user interface has to make the translation from how systems like to
think about the world to how users think about their work. And yes, how
users think about their work is shaped by the systems and concepts they¹ve
interacted with previously, setting their expectations. But not entirely.

To some extent, the Scherzo interface represents an acknowledgement of
this after what we learned in the Variations project when trying to make a
stepwise FRBRish disambiguation search interface. Here¹s our paper
describing that earlier effort:
http://www.dlib.indiana.edu/~jenlrile/publications/ecdl2004/ecdl.pdf

Mark
--
Mark Notess
Head, User Experience and Digital Media Services
Library Technologies
Indiana University Bloomington Libraries
+1.812.856.0494
mnot...@iu.edu 



On 12/6/13, 10:18 AM, Julie Hardesty jlhar...@gmail.com wrote:

Hi Kelley - I conducted that usability test on Scherzo and wrote that
report so I can answer your questions!  I think a work-focused approach
can
work for users, but we had to scale back on what we assumed users would
understand on the search results page.  After this test of the system, we
changed the search results interface to identify within the works list how
many scores and recordings contained that work, so the works list looked
more like a facet.  The works list then wasn't just a list of titles, but
was tied more directly to the recordings/scores result list (which is
directly below the works list on the search results page).

I do think that some of the testing results we saw reflected how users are
used to searching for music in traditional catalogs.  While the work is a
key concept for musicians, they may have gotten used to the fact that
searching for or scanning a results list for a work title often isn't easy
(or even possible) in a library catalog so either the title of the album
or
a person's name is the real key to finding stuff.  I think that also might
have been part of what threw people off seeing the works listed in the
search results.  They didn't believe they were seeing titles of songs -
they thought they were seeing titles of albums or something that was some
sort of physical item.  They weren't really sure what it was and so they
just skipped that list of things.  So adding the info that, for example, a
work title is found on 5 recordings/scores really helped to identify the
works list as such.

Music is kind of unique within FRBR since several works can be involved in
a single manifestation (recording or score) and a single work can have
many
different expressions (different performances by different people of the
same work).  Other types of resources like books and movies don't often
line up with the FRBR model the same way.  I can't say for sure whether or
not the interface we arrived at after this testing (
http://vfrbr.info/scherzo/) could be used for other work-based resources
with a works list serving as a facet to narrow down results, but it seems
to be a good use of the FRBR model.

Here's an example of a search that I think brings out the strength of what
this type of works list can do.  Searching in Scherzo for something like
symphony no. 5 as Keyword results in several works with that same (or
similar) title and lots of recordings and scores that contain expressions
of all of the different symphony no. 5 works.  The facet nature of
showing how many recordings/scores contain that work can help to
distinguish which work is the symphony no. 5 you actually want and helps
identify that works list as a list of symphony no. 5 works by different
composers.

I hope this is helpful - it was an interesting project to test these
FRBRized search concepts and it would be great to see further experiments
with this idea, specifically with non-music resources to see if it can be
applied or not.  Let me know if you have any more questions about what we
did with the Scherzo interface and best of luck on your project!

Julie Hardesty
Metadata Analyst
Metadata Resources  Systems
Library Technologies
Indiana University



On Tue, Dec 3, 2013 at 10:58 PM, Kelley McGrath kell...@uoregon.edu
wrote:

 Thanks, Jon. I have seen the Variations work and also talked to Jenn
Riley
 about it. It has definitely influenced me, although we are going in a
 slightly different direction and moving images have some different needs
 from music.

 One thing about Variations that struck me is this paragraph from the
 usability testing report (
 

Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-03 Thread Dunn, Jon William Butcher
Hi Kelley,

If you haven't already, you might want to look at the music score and sound 
recording FRBRization work done on the Variations-FRBR project here at Indiana 
University. I'm not sure how directly useful this would be for your work with 
moving images, but there may be some useful mapping ideas:

FRBR XML schemas: 
http://www.dlib.indiana.edu/projects/vfrbr/schemas/1.1/index.shtml 

MARC-FRBR mapping specifications: 
http://www.dlib.indiana.edu/projects/vfrbr/projectDoc/metadata/mappings/spring2010/vfrbrSpring2010mappings.shtml
 

Java FRBRization code and documentation: 
http://www.dlib.indiana.edu/projects/vfrbr/projectDoc/index.shtml 

Jon

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kelley 
McGrath
Sent: Tuesday, December 03, 2013 12:35 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Looking for two coders to help with discoverability of 
videos

Robert,

Your work also sounds very interesting and definitely overlaps with some of 
what we want to do. It seems like a lot of people are trying to get useful 
format information out of MARC records and it's unfortunate that it is so 
complicated. I would be very interested to see your logic for determining 
format and dealing with self-contradictory records. Runtime from the 008 is, as 
you say, pretty straightforward, but not always filled out and useless if the 
resource is longer than 999 minutes.

It's interesting that you mention identifying directors. We have also been 
working on a similar, although more generalized, process. We're trying to 
identify all of the personal and organizational names mentioned in video 
records and, where possible, their roles. Our existing process is pretty 
accurate for personal names and for roles in English. It tends to struggle with 
credits involving multiple corporate bodies and we're working on building a 
lexicon of non-English terms for common roles. We're also trying to get people 
to hand-annotate credits to build a corpus to help us improve our process. 
(Help us out at http://olac-annotator.org/. And if you're willing to be on call 
to help with translating non-English credits, email me with the language(s) 
you'd be able to help out with. We also just started a mailing list at 
https://lists.uoregon.edu/mailman/listinfo/olac-credits)

Matching MARC records for moving images with external data sources is also on 
our radar. Most feature film type material can probably be identified by the 
attributes you mention: title, original date and director (probably 2 out of 3 
would work in most cases). We are also hoping to use these attributes (and 
possibly others) to cluster records for the same FRBR work.

It would be great to talk with you more about this off-list.

Kelley
kell...@uoregon.edu

From: Robert Haschart [rh...@virginia.edu]
Sent: Monday, December 02, 2013 10:49 AM
To: Code for Libraries
Cc: Kelley McGrath
Subject: Re: [CODE4LIB] Looking for two coders to help with discoverability of 
videos

Kelley,

The work you are proposing is interesting and overlaps somewhat both with work 
I have already done and with a new project I'm looking into here at UVa.
I have been the primary contributor to the Marc4j java project for the past 
several years and am the creator of the project SolrMarc which extracts data 
from Marc records based on a customizable specification, to build Solr index 
records to facilitate rich discovery.

Much of my work on creating and improving these projects has been in service of 
my actual job of creating and maintaining the Solr Index
behind our Blacklight-based discovery interface.   As a part of that
work I have created custom SolrMarc routines that extract the format of items 
similar to what is described in Example 3, including looking in the leader, 
006, 007 and 008 to determine the format as-coded but further looking in the 
245 h, 300 and 538 fields to heuristically determine when the format as-coded 
is incorrect and ought to be
overridden.   Most of the heuristic determination is targeted towards
Video material, and was initiated when I found an item that due to a coding 
error was listed as a Video in Braille format.

Further I have developed a set of custom routines that look more closely at 
Video items, one of which already extracts the runtime from the 008[18-20] 
field, To modify it from its current form that currently returns the runtime in
minutes, to instead return it as   HH:MM as specified in your xls file,
and to further handle the edge case of  008[18-20] = 000  to return over 
16:39 would literally take about 15 minutes.

Another of these custom routines that is more fully-formed, is code for 
extracting the Director of a video from the Marc record.  It examines the 
contents of the fields 245c, 508a, 500a, 505a, 505t, employing heuristics and 
targeted natural language processing techniques, to
attempt to correctly extract the Director.   At 

Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-03 Thread Kelley McGrath
Thanks, Jon. I have seen the Variations work and also talked to Jenn Riley 
about it. It has definitely influenced me, although we are going in a slightly 
different direction and moving images have some different needs from music.

One thing about Variations that struck me is this paragraph from the usability 
testing report 
(http://www.dlib.indiana.edu/projects/vfrbr/projectDoc/usability/usabilityTest/ScherzoUTestReport.pdf):

There was an assumption among the development team that works would be a 
window for organizing and narrowing results in a way that users searching for 
scores and recordings would find useful. One of the main ideas behind FRBR is 
that the work, or the intellectual entity that is produced by people and is 
packaged in many forms, is the core information – Scherzo’s interface reflected 
that organization. 4 (See Appendix E, Fig. 14 for Scherzo’s search results 
page.) But the participants tended to latch on to a person’s name and search 
for that name in a particular role. The reasons for this are not completely 
clear and further discussion follows, but it is worth bearing this finding in 
mind. Additionally, from the search results page, work results were clicked 
only 14 times in comparison to items in recordings  scores , which were 
clicked 65 times. Regardless of how the FRBRized data is organized on the back 
end, the interface needs to reflect the way users want to search, and that 
might not mean with search results organized by work.

Does this mean that a work-focused approach is not actually what users want or 
need? Does it mean that the work-centered approach needs to be implemented 
differently in the user interface? Are these results somehow specific to music? 
Do they reflect users' familiarity with the typical library catalog and the 
strategies they've become accustomed to using?

It does suggest to me that there should be more studies on how users interact 
with FRBRized data (and not just the clustering that so many discovery 
interfaces do now, but real FRBR-based data) and how FRBRized data is best 
presented.

Kelley

On Tue, Dec 3, 2013 at 11:35 AM, Dunn, Jon William Butcher 
j...@iu.edumailto:j...@iu.edu wrote:
Hi Kelley,

If you haven't already, you might want to look at the music score and sound 
recording FRBRization work done on the Variations-FRBR project here at Indiana 
University. I'm not sure how directly useful this would be for your work with 
moving images, but there may be some useful mapping ideas:

FRBR XML schemas: 
http://www.dlib.indiana.edu/projects/vfrbr/schemas/1.1/index.shtml

MARC-FRBR mapping specifications: 
http://www.dlib.indiana.edu/projects/vfrbr/projectDoc/metadata/mappings/spring2010/vfrbrSpring2010mappings.shtml

Java FRBRization code and documentation: 
http://www.dlib.indiana.edu/projects/vfrbr/projectDoc/index.shtml

Jon


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-02 Thread Alexander Duryee
Is it out of the question to extract technical metadata from the
audiovisual materials themselves (via MediaInfo et al)?  It would minimize
the amount of MARC that needs to be processed and give more
accurate/complete data than relying on old cataloging records.


On Mon, Dec 2, 2013 at 12:37 AM, Kelley McGrath kell...@uoregon.edu wrote:

 I wanted to follow up on my previous post with a couple points.

 1. This is probably too late for anybody thinking about applying, but I
 thought there may be some general interest. I have put up some more
 detailed specifications about what I am hoping to do at
 http://pages.uoregon.edu/kelleym/miw/. Data extraction overview.doc is
 the general overview and the other files contain supporting documents.

 2. I replied some time ago to Heather's offer below about her website that
 will connect researchers with volunteer software developers. I have to
 admit that looking for volunteer software developers had not really
 occurred to me. However, I do have additional things that I would like to
 do for which I currently have no funding so if you would be interested in
 volunteering in the future, let me know.

 Kelley
 kell...@uoregon.edu


 On Tue, Nov 12, 2013 at 6:33 PM, Heather Claxton claxt...@gmail.com
 mailto:claxt...@gmail.com wrote:
 Hi Kelley,

 I might be able to help in your search.   I'm in the process of starting a
 website that connects academic researchers with volunteer software
 developers.  I'm looking for people to post programming projects on the
 website once it's launched in late January.   I realize that may be a
 little late for you, but perhaps the project you mentioned in your PS
 (clustering based on title, name, date ect.) would be perfect?  The
 one caveat is that the website is targeting software developers who wish to
 volunteer.   Anyway, if you're interested in posting, please send me an
 e-mail  at  sciencesolved2...@gmail.commailto:sciencesolved2...@gmail.com
I would greatly appreciate it.
 Oh and of course it would be free to post  :)  Best of luck in your
 hiring process,

 Heather Claxton-Douglas


 On Mon, Nov 11, 2013 at 9:58 PM, Kelley McGrath kell...@uoregon.edu
 mailto:kell...@uoregon.edu wrote:

  I have a small amount of money to work with and am looking for two people
  to help with extracting data from MARC records as described below. This
 is
  part of a larger project to develop a FRBR-based data store and discovery
  interface for moving images. Our previous work includes a consideration
 of
  the feasibility of the project from a cataloging perspective (
  http://www.olacinc.org/drupal/?q=node/27), a prototype end-user
 interface
  (https://blazing-sunset-24.heroku.com/,
  https://blazing-sunset-24.heroku.com/page/about) and a web form to
  crowdsource the parsing of movie credits (
  http://olac-annotator.org/#/about).
  Planned work period: six months beginning around the second week of
  December (I can be somewhat flexible on the dates if you want to wait and
  start after the New Year)
  Payment: flat sum of $2500 upon completion of the work
 
  Required skills and knowledge:
 
*   Familiarity with the MARC 21 bibliographic format
*   Familiarity with Natural Language Processing concepts (or
  willingness to learn)
*   Experience with Java, Python, and/or Ruby programming languages
 
  Description of work: Use language and text processing tools and provided
  strategies to write code to extract and normalize data in existing MARC
  bibliographic records for moving images. Refine code based on feedback
 from
  analysis of results obtained with a sample dataset.
 
  Data to be extracted:
  Tasks for Position 1:
  Titles (including the main title of the video, uniform titles, variant
  titles, series titles, television program titles and titles of contents)
  Authors and titles of related works on which an adaptation is based
  Duration
  Color
  Sound vs. silent
  Tasks for Position 2:
  Format (DVD, VHS, film, online, etc.)
  Original language
  Country of production
  Aspect ratio
  Flag for whether a record represents multiple works or not
  We have already done some work with dates, names and roles and have a
  framework to work in. I have the basic logic for the data extraction
  processes, but expect to need some iteration to refine these strategies.
 
  To apply please send me an email at kelleym@uoregon explaining why you
  are interested in this project, what relevant experience you would bring
  and any other reasons why I should hire you. If you have a preference for
  position 1 or 2, let me know (it's not necessary to have a preference).
 The
  deadline for applications is Monday, December 2, 2013. Let me know if you
  have any questions.
 
  Thank you for your consideration.
 
  Kelley
 
  PS In the near future, I will also be looking for someone to help with
  work clustering based on title, name, date and identifier data from MARC
  records. This will not involve any direct 

Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-02 Thread Kyle Banerjee
 Is it out of the question to extract technical metadata from the
 audiovisual materials themselves (via MediaInfo et al)?


One of the things that absolutely blows my mind is the widespread practice
of hand typing this stuff into records. Aside from an obvious opportunity
to introduce errors/inconsistencies, many libraries record details for the
archival versions rather than the access versions actually provided. So
patrons see a description for what they're not getting...

Just for the heck of it, sometime last year I scanned thousands of objects
and their descriptions to see how close they were. Like an idiot, I didn't
write up what I learned because I was just trying to satisfy my own
curiosity. However, the takeaway I got from the exercise was that the
embedded info is so much better than the hand keyed stuff that you'd be
nuts to consider the latter as authoritative. Curiously, I did find cases
where the embedded info was clearly incorrect. I can only guess that was
manually edited.

kyle


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-02 Thread Roy Tennant
I would have to agree with this where the data exists. The data captured by
digital cameras these days can be incredibly extensive and thorough. Given
this, I recently started exposing this data for all of the 8,000 photos I
now have on my photos web site http://FreeLargePhotos.com/ . There is now a
link on the page for an individual photo that a user can click on that will
pull out the data dynamically from the image file and display it in plain
text. Here is a random example:

http://freelargephotos.com/photos/003171/exif.txt

The tricky bit is of course where the photo is actually scanned from a
slide, which of course plays havoc with items such as the creation date. So
depending on the exact situation your mileage may vary, but the basic
principle stands -- if you can allow a machine to capture the metadata then
by all means let it.
Roy


On Mon, Dec 2, 2013 at 9:06 AM, Kyle Banerjee kyle.baner...@gmail.comwrote:

  Is it out of the question to extract technical metadata from the
  audiovisual materials themselves (via MediaInfo et al)?


 One of the things that absolutely blows my mind is the widespread practice
 of hand typing this stuff into records. Aside from an obvious opportunity
 to introduce errors/inconsistencies, many libraries record details for the
 archival versions rather than the access versions actually provided. So
 patrons see a description for what they're not getting...

 Just for the heck of it, sometime last year I scanned thousands of objects
 and their descriptions to see how close they were. Like an idiot, I didn't
 write up what I learned because I was just trying to satisfy my own
 curiosity. However, the takeaway I got from the exercise was that the
 embedded info is so much better than the hand keyed stuff that you'd be
 nuts to consider the latter as authoritative. Curiously, I did find cases
 where the embedded info was clearly incorrect. I can only guess that was
 manually edited.

 kyle



Re: [CODE4LIB] Looking for two coders to help with discoverability of videos - Embedded Metadata

2013-12-02 Thread Kari R Smith
I've been working with embedded metadata for some years and there are great 
tools out there for embedding, extracting and reusing metadata (technical, 
administrative, and descriptive).  The tools allow for batch data entry, use 
metadata schema or standards.  As a digital archivist whose job is to take in 
lots of this digitized content that generally has no context or that context is 
lost or misplaced, I wholly advocate for embedding metadata.  There are 
consumer products that can then expose this metadata so that it doesn't have to 
be retyped again and again.

What gets my goat is when I hear folks belabor the effort but don't talk about 
the rewards and opportunities that embedding metadata can bring.  Forthcoming 
use cases from The Royal Library in Denmark about mass digitization and 
embedding metadata as well as using the Exif / IPTC Extension for describing 
the content in image files.  There's also work being done with video and audio 
and CAD files.  

Check out these resources on Embedded Metadata from the VRA Embedded Metadata 
Working Group (Greg Reser, Chair):
About Embedded Metadata:  
http://metadatadeluxe.pbworks.com/w/page/62407805/Concepts
http://metadatadeluxe.pbworks.com/w/page/20792256/Other%20Organizations
Case Studies:  http://metadatadeluxe.pbworks.com/w/page/62407826/Communities

Okay, I'll step off my soap box now...
Kari

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of Kyle 
Banerjee
Sent: Monday, December 02, 2013 12:06 PM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] Looking for two coders to help with discoverability of 
videos

 Is it out of the question to extract technical metadata from the 
 audiovisual materials themselves (via MediaInfo et al)?


One of the things that absolutely blows my mind is the widespread practice of 
hand typing this stuff into records. Aside from an obvious opportunity to 
introduce errors/inconsistencies, many libraries record details for the 
archival versions rather than the access versions actually provided. So patrons 
see a description for what they're not getting...

Just for the heck of it, sometime last year I scanned thousands of objects and 
their descriptions to see how close they were. Like an idiot, I didn't write up 
what I learned because I was just trying to satisfy my own curiosity. However, 
the takeaway I got from the exercise was that the embedded info is so much 
better than the hand keyed stuff that you'd be nuts to consider the latter as 
authoritative. Curiously, I did find cases where the embedded info was clearly 
incorrect. I can only guess that was manually edited.

kyle


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-02 Thread Robert Haschart

Kelley,

The work you are proposing is interesting and overlaps somewhat both 
with work I have already done and with a new project I'm looking into 
here at UVa.
I have been the primary contributor to the Marc4j java project for the 
past several years and am the creator of the project SolrMarc which 
extracts data from Marc records based on a customizable specification, 
to build Solr index records to facilitate rich discovery.


Much of my work on creating and improving these projects has been in 
service of my actual job of creating and maintaining the Solr Index 
behind our Blacklight-based discovery interface.   As a part of that 
work I have created custom SolrMarc routines that extract the format of 
items similar to what is described in Example 3, including looking in 
the leader, 006, 007 and 008 to determine the format as-coded but 
further looking in the 245 h, 300 and 538 fields to heuristically 
determine when the format as-coded is incorrect and ought to be 
overridden.   Most of the heuristic determination is targeted towards 
Video material, and was initiated when I found an item that due to a 
coding error was listed as a Video in Braille format.


Further I have developed a set of custom routines that look more closely 
at Video items, one of which already extracts the runtime from the 
008[18-20] field,
To modify it from its current form that currently returns the runtime in 
minutes, to instead return it as   HH:MM as specified in your xls file, 
and to further handle the edge case of  008[18-20] = 000  to return 
over 16:39 would literally take about 15 minutes.


Another of these custom routines that is more fully-formed, is code for 
extracting the Director of a video from the Marc record.  It examines 
the contents of the fields 245c, 508a, 500a, 505a, 505t, employing 
heuristics and targeted natural language processing techniques, to 
attempt to correctly extract the Director.   At this point I believe 
it achieves better results than a careful cataloger would achieve, even 
one who specializes in film and video.


The other project I have just started investigating is an effort to 
create and/or flesh out Marc records for video items based on heuristic 
matching of title and director and date with data returned from 
publicly-accessible movie information sites.


This more recent work may not be relevant to your needs but the custom 
extraction routines seem directly applicable to your goals, and may also 
provide a template that may make your other goals more easily achievable.


-Robert Haschart

On 12/2/2013 12:37 AM, Kelley McGrath wrote:

I wanted to follow up on my previous post with a couple points.

1. This is probably too late for anybody thinking about applying, but I thought 
there may be some general interest. I have put up some more detailed 
specifications about what I am hoping to do at 
http://pages.uoregon.edu/kelleym/miw/. Data extraction overview.doc is the 
general overview and the other files contain supporting documents.

2. I replied some time ago to Heather's offer below about her website that will 
connect researchers with volunteer software developers. I have to admit that 
looking for volunteer software developers had not really occurred to me. 
However, I do have additional things that I would like to do for which I 
currently have no funding so if you would be interested in volunteering in the 
future, let me know.

Kelley
kell...@uoregon.edu


On Tue, Nov 12, 2013 at 6:33 PM, Heather 
Claxtonclaxt...@gmail.commailto:claxt...@gmail.com  wrote:
Hi Kelley,

I might be able to help in your search.   I'm in the process of starting a
website that connects academic researchers with volunteer software
developers.  I'm looking for people to post programming projects on the
website once it's launched in late January.   I realize that may be a
little late for you, but perhaps the project you mentioned in your PS
(clustering based on title, name, date ect.) would be perfect?  The
one caveat is that the website is targeting software developers who wish to
volunteer.   Anyway, if you're interested in posting, please send me an
e-mail  at  sciencesolved2...@gmail.commailto:sciencesolved2...@gmail.com 
I would greatly appreciate it.
Oh and of course it would be free to post  :)  Best of luck in your
hiring process,

Heather Claxton-Douglas


On Mon, Nov 11, 2013 at 9:58 PM, Kelley 
McGrathkell...@uoregon.edumailto:kell...@uoregon.edu  wrote:


I have a small amount of money to work with and am looking for two people
to help with extracting data from MARC records as described below. This is
part of a larger project to develop a FRBR-based data store and discovery
interface for moving images. Our previous work includes a consideration of
the feasibility of the project from a cataloging perspective (
http://www.olacinc.org/drupal/?q=node/27), a prototype end-user interface
(https://blazing-sunset-24.heroku.com/,

Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-02 Thread Kelley McGrath
Well, that would be much easier, but most of what I am working with are records 
for physical items (DVD, VHS, film) or licensed streaming video. The sample 
records are also not all UO records so I don't necessarily even have access to 
the source material (our goal is to build a general purpose tool). So I think I 
am stuck with extracting from MARC.

We should be able to get data for some resources by matching the MARC up with 
external data sources. That won't work for everything, though, so we want to 
make the process of extracting data from MARC as effective as possible.

Kelley


On Mon, Dec 2, 2013 at 7:03 AM, Alexander Duryee 
alexanderdur...@gmail.commailto:alexanderdur...@gmail.com wrote:
Is it out of the question to extract technical metadata from the
audiovisual materials themselves (via MediaInfo et al)?  It would minimize
the amount of MARC that needs to be processed and give more
accurate/complete data than relying on old cataloging records.


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-02 Thread Kelley McGrath
Robert,

Your work also sounds very interesting and definitely overlaps with some of 
what we want to do. It seems like a lot of people are trying to get useful 
format information out of MARC records and it's unfortunate that it is so 
complicated. I would be very interested to see your logic for determining 
format and dealing with self-contradictory records. Runtime from the 008 is, as 
you say, pretty straightforward, but not always filled out and useless if the 
resource is longer than 999 minutes.

It's interesting that you mention identifying directors. We have also been 
working on a similar, although more generalized, process. We're trying to 
identify all of the personal and organizational names mentioned in video 
records and, where possible, their roles. Our existing process is pretty 
accurate for personal names and for roles in English. It tends to struggle with 
credits involving multiple corporate bodies and we're working on building a 
lexicon of non-English terms for common roles. We're also trying to get people 
to hand-annotate credits to build a corpus to help us improve our process. 
(Help us out at http://olac-annotator.org/. And if you're willing to be on call 
to help with translating non-English credits, email me with the language(s) 
you'd be able to help out with. We also just started a mailing list at 
https://lists.uoregon.edu/mailman/listinfo/olac-credits)

Matching MARC records for moving images with external data sources is also on 
our radar. Most feature film type material can probably be identified by the 
attributes you mention: title, original date and director (probably 2 out of 3 
would work in most cases). We are also hoping to use these attributes (and 
possibly others) to cluster records for the same FRBR work.

It would be great to talk with you more about this off-list.

Kelley
kell...@uoregon.edu

From: Robert Haschart [rh...@virginia.edu]
Sent: Monday, December 02, 2013 10:49 AM
To: Code for Libraries
Cc: Kelley McGrath
Subject: Re: [CODE4LIB] Looking for two coders to help with discoverability of 
videos

Kelley,

The work you are proposing is interesting and overlaps somewhat both
with work I have already done and with a new project I'm looking into
here at UVa.
I have been the primary contributor to the Marc4j java project for the
past several years and am the creator of the project SolrMarc which
extracts data from Marc records based on a customizable specification,
to build Solr index records to facilitate rich discovery.

Much of my work on creating and improving these projects has been in
service of my actual job of creating and maintaining the Solr Index
behind our Blacklight-based discovery interface.   As a part of that
work I have created custom SolrMarc routines that extract the format of
items similar to what is described in Example 3, including looking in
the leader, 006, 007 and 008 to determine the format as-coded but
further looking in the 245 h, 300 and 538 fields to heuristically
determine when the format as-coded is incorrect and ought to be
overridden.   Most of the heuristic determination is targeted towards
Video material, and was initiated when I found an item that due to a
coding error was listed as a Video in Braille format.

Further I have developed a set of custom routines that look more closely
at Video items, one of which already extracts the runtime from the
008[18-20] field,
To modify it from its current form that currently returns the runtime in
minutes, to instead return it as   HH:MM as specified in your xls file,
and to further handle the edge case of  008[18-20] = 000  to return
over 16:39 would literally take about 15 minutes.

Another of these custom routines that is more fully-formed, is code for
extracting the Director of a video from the Marc record.  It examines
the contents of the fields 245c, 508a, 500a, 505a, 505t, employing
heuristics and targeted natural language processing techniques, to
attempt to correctly extract the Director.   At this point I believe
it achieves better results than a careful cataloger would achieve, even
one who specializes in film and video.

The other project I have just started investigating is an effort to
create and/or flesh out Marc records for video items based on heuristic
matching of title and director and date with data returned from
publicly-accessible movie information sites.

This more recent work may not be relevant to your needs but the custom
extraction routines seem directly applicable to your goals, and may also
provide a template that may make your other goals more easily achievable.

-Robert Haschart


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-12-01 Thread Kelley McGrath
I wanted to follow up on my previous post with a couple points.

1. This is probably too late for anybody thinking about applying, but I thought 
there may be some general interest. I have put up some more detailed 
specifications about what I am hoping to do at 
http://pages.uoregon.edu/kelleym/miw/. Data extraction overview.doc is the 
general overview and the other files contain supporting documents.

2. I replied some time ago to Heather's offer below about her website that will 
connect researchers with volunteer software developers. I have to admit that 
looking for volunteer software developers had not really occurred to me. 
However, I do have additional things that I would like to do for which I 
currently have no funding so if you would be interested in volunteering in the 
future, let me know.

Kelley
kell...@uoregon.edu


On Tue, Nov 12, 2013 at 6:33 PM, Heather Claxton 
claxt...@gmail.commailto:claxt...@gmail.com wrote:
Hi Kelley,

I might be able to help in your search.   I'm in the process of starting a
website that connects academic researchers with volunteer software
developers.  I'm looking for people to post programming projects on the
website once it's launched in late January.   I realize that may be a
little late for you, but perhaps the project you mentioned in your PS
(clustering based on title, name, date ect.) would be perfect?  The
one caveat is that the website is targeting software developers who wish to
volunteer.   Anyway, if you're interested in posting, please send me an
e-mail  at  sciencesolved2...@gmail.commailto:sciencesolved2...@gmail.com
I would greatly appreciate it.
Oh and of course it would be free to post  :)  Best of luck in your
hiring process,

Heather Claxton-Douglas


On Mon, Nov 11, 2013 at 9:58 PM, Kelley McGrath 
kell...@uoregon.edumailto:kell...@uoregon.edu wrote:

 I have a small amount of money to work with and am looking for two people
 to help with extracting data from MARC records as described below. This is
 part of a larger project to develop a FRBR-based data store and discovery
 interface for moving images. Our previous work includes a consideration of
 the feasibility of the project from a cataloging perspective (
 http://www.olacinc.org/drupal/?q=node/27), a prototype end-user interface
 (https://blazing-sunset-24.heroku.com/,
 https://blazing-sunset-24.heroku.com/page/about) and a web form to
 crowdsource the parsing of movie credits (
 http://olac-annotator.org/#/about).
 Planned work period: six months beginning around the second week of
 December (I can be somewhat flexible on the dates if you want to wait and
 start after the New Year)
 Payment: flat sum of $2500 upon completion of the work

 Required skills and knowledge:

   *   Familiarity with the MARC 21 bibliographic format
   *   Familiarity with Natural Language Processing concepts (or
 willingness to learn)
   *   Experience with Java, Python, and/or Ruby programming languages

 Description of work: Use language and text processing tools and provided
 strategies to write code to extract and normalize data in existing MARC
 bibliographic records for moving images. Refine code based on feedback from
 analysis of results obtained with a sample dataset.

 Data to be extracted:
 Tasks for Position 1:
 Titles (including the main title of the video, uniform titles, variant
 titles, series titles, television program titles and titles of contents)
 Authors and titles of related works on which an adaptation is based
 Duration
 Color
 Sound vs. silent
 Tasks for Position 2:
 Format (DVD, VHS, film, online, etc.)
 Original language
 Country of production
 Aspect ratio
 Flag for whether a record represents multiple works or not
 We have already done some work with dates, names and roles and have a
 framework to work in. I have the basic logic for the data extraction
 processes, but expect to need some iteration to refine these strategies.

 To apply please send me an email at kelleym@uoregon explaining why you
 are interested in this project, what relevant experience you would bring
 and any other reasons why I should hire you. If you have a preference for
 position 1 or 2, let me know (it's not necessary to have a preference). The
 deadline for applications is Monday, December 2, 2013. Let me know if you
 have any questions.

 Thank you for your consideration.

 Kelley

 PS In the near future, I will also be looking for someone to help with
 work clustering based on title, name, date and identifier data from MARC
 records. This will not involve any direct interaction with MARC.


 Kelley McGrath
 Metadata Management Librarian
 University of Oregon Libraries
 541-346-8232tel:541-346-8232
 kell...@uoregon.edumailto:kell...@uoregon.edu



Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-11-12 Thread Edward Summers
Hi Kelley, 

Thanks for posting this. When I began work on jobs.code4lib.org I was hoping it 
would encourage people to post short term contracts. The thought being that it 
may be easier for some institutions to find money for projects than full-time 
staff, and it could encourage more open source collaboration between 
organizations, similar to what the Hydra Project are doing.

So, I added your post to jobs.code4lib.org [1]. Ordinarily the person who 
publishes a job posting is the only one who can edit it. But if you would like 
to make any changes to it please let me know and I’ll make you the editor.

Incidentally I was curious about your decision to hire two programmers to do 
what appears to be a very similar task. Was your intent to have two 
implementations to compare to see which you liked better? Were the two 
developers supposed to work together or separately?

//Ed

[1] http://jobs.code4lib.org/job/10658/

On Nov 11, 2013, at 10:58 PM, Kelley McGrath kell...@uoregon.edu wrote:

 I have a small amount of money to work with and am looking for two people to 
 help with extracting data from MARC records as described below. This is part 
 of a larger project to develop a FRBR-based data store and discovery 
 interface for moving images. Our previous work includes a consideration of 
 the feasibility of the project from a cataloging perspective 
 (http://www.olacinc.org/drupal/?q=node/27), a prototype end-user interface 
 (https://blazing-sunset-24.heroku.com/, 
 https://blazing-sunset-24.heroku.com/page/about) and a web form to 
 crowdsource the parsing of movie credits (http://olac-annotator.org/#/about).
 Planned work period: six months beginning around the second week of December 
 (I can be somewhat flexible on the dates if you want to wait and start after 
 the New Year)
 Payment: flat sum of $2500 upon completion of the work
 
 Required skills and knowledge:
 
  *   Familiarity with the MARC 21 bibliographic format
  *   Familiarity with Natural Language Processing concepts (or willingness to 
 learn)
  *   Experience with Java, Python, and/or Ruby programming languages
 
 Description of work: Use language and text processing tools and provided 
 strategies to write code to extract and normalize data in existing MARC 
 bibliographic records for moving images. Refine code based on feedback from 
 analysis of results obtained with a sample dataset.
 
 Data to be extracted:
 Tasks for Position 1:
 Titles (including the main title of the video, uniform titles, variant 
 titles, series titles, television program titles and titles of contents)
 Authors and titles of related works on which an adaptation is based
 Duration
 Color
 Sound vs. silent
 Tasks for Position 2:
 Format (DVD, VHS, film, online, etc.)
 Original language
 Country of production
 Aspect ratio
 Flag for whether a record represents multiple works or not
 We have already done some work with dates, names and roles and have a 
 framework to work in. I have the basic logic for the data extraction 
 processes, but expect to need some iteration to refine these strategies.
 
 To apply please send me an email at kelleym@uoregon explaining why you are 
 interested in this project, what relevant experience you would bring and any 
 other reasons why I should hire you. If you have a preference for position 1 
 or 2, let me know (it's not necessary to have a preference). The deadline for 
 applications is Monday, December 2, 2013. Let me know if you have any 
 questions.
 
 Thank you for your consideration.
 
 Kelley
 
 PS In the near future, I will also be looking for someone to help with work 
 clustering based on title, name, date and identifier data from MARC records. 
 This will not involve any direct interaction with MARC.
 
 
 Kelley McGrath
 Metadata Management Librarian
 University of Oregon Libraries
 541-346-8232
 kell...@uoregon.edu


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-11-12 Thread Al Matthews
+1 for what I know of Avalon Media service

--
Al Matthews

Software Developer, Digital Services Unit
Atlanta University Center, Robert W. Woodruff Library
email: amatth...@auctr.edu; office: 1 404 978 2057





On 11/12/13 8:21 AM, Edward Summers e...@pobox.com wrote:

Hi Kelley,

Thanks for posting this. When I began work on jobs.code4lib.org I was
hoping it would encourage people to post short term contracts. The
thought being that it may be easier for some institutions to find money
for projects than full-time staff, and it could encourage more open
source collaboration between organizations, similar to what the Hydra
Project are doing.

So, I added your post to jobs.code4lib.org [1]. Ordinarily the person who
publishes a job posting is the only one who can edit it. But if you would
like to make any changes to it please let me know and I’ll make you the
editor.

Incidentally I was curious about your decision to hire two programmers to
do what appears to be a very similar task. Was your intent to have two
implementations to compare to see which you liked better? Were the two
developers supposed to work together or separately?

//Ed

[1] http://jobs.code4lib.org/job/10658/

On Nov 11, 2013, at 10:58 PM, Kelley McGrath kell...@uoregon.edu wrote:

 I have a small amount of money to work with and am looking for two
people to help with extracting data from MARC records as described
below. This is part of a larger project to develop a FRBR-based data
store and discovery interface for moving images. Our previous work
includes a consideration of the feasibility of the project from a
cataloging perspective (http://www.olacinc.org/drupal/?q=node/27), a
prototype end-user interface (https://blazing-sunset-24.heroku.com/,
https://blazing-sunset-24.heroku.com/page/about) and a web form to
crowdsource the parsing of movie credits
(http://olac-annotator.org/#/about).
 Planned work period: six months beginning around the second week of
December (I can be somewhat flexible on the dates if you want to wait
and start after the New Year)
 Payment: flat sum of $2500 upon completion of the work

 Required skills and knowledge:

  *   Familiarity with the MARC 21 bibliographic format
  *   Familiarity with Natural Language Processing concepts (or
willingness to learn)
  *   Experience with Java, Python, and/or Ruby programming languages

 Description of work: Use language and text processing tools and
provided strategies to write code to extract and normalize data in
existing MARC bibliographic records for moving images. Refine code based
on feedback from analysis of results obtained with a sample dataset.

 Data to be extracted:
 Tasks for Position 1:
 Titles (including the main title of the video, uniform titles, variant
titles, series titles, television program titles and titles of contents)
 Authors and titles of related works on which an adaptation is based
 Duration
 Color
 Sound vs. silent
 Tasks for Position 2:
 Format (DVD, VHS, film, online, etc.)
 Original language
 Country of production
 Aspect ratio
 Flag for whether a record represents multiple works or not
 We have already done some work with dates, names and roles and have a
framework to work in. I have the basic logic for the data extraction
processes, but expect to need some iteration to refine these strategies.

 To apply please send me an email at kelleym@uoregon explaining why you
are interested in this project, what relevant experience you would bring
and any other reasons why I should hire you. If you have a preference
for position 1 or 2, let me know (it's not necessary to have a
preference). The deadline for applications is Monday, December 2, 2013.
Let me know if you have any questions.

 Thank you for your consideration.

 Kelley

 PS In the near future, I will also be looking for someone to help with
work clustering based on title, name, date and identifier data from MARC
records. This will not involve any direct interaction with MARC.


 Kelley McGrath
 Metadata Management Librarian
 University of Oregon Libraries
 541-346-8232
 kell...@uoregon.edu


Re: [CODE4LIB] Looking for two coders to help with discoverability of videos

2013-11-12 Thread Heather Claxton
Hi Kelley,

I might be able to help in your search.   I'm in the process of starting a
website that connects academic researchers with volunteer software
developers.  I'm looking for people to post programming projects on the
website once it's launched in late January.   I realize that may be a
little late for you, but perhaps the project you mentioned in your PS
(clustering based on title, name, date ect.) would be perfect?  The
one caveat is that the website is targeting software developers who wish to
volunteer.   Anyway, if you're interested in posting, please send me an
e-mail  at  sciencesolved2...@gmail.comI would greatly appreciate it.
Oh and of course it would be free to post  :)  Best of luck in your
hiring process,

Heather Claxton-Douglas


On Mon, Nov 11, 2013 at 9:58 PM, Kelley McGrath kell...@uoregon.edu wrote:

 I have a small amount of money to work with and am looking for two people
 to help with extracting data from MARC records as described below. This is
 part of a larger project to develop a FRBR-based data store and discovery
 interface for moving images. Our previous work includes a consideration of
 the feasibility of the project from a cataloging perspective (
 http://www.olacinc.org/drupal/?q=node/27), a prototype end-user interface
 (https://blazing-sunset-24.heroku.com/,
 https://blazing-sunset-24.heroku.com/page/about) and a web form to
 crowdsource the parsing of movie credits (
 http://olac-annotator.org/#/about).
 Planned work period: six months beginning around the second week of
 December (I can be somewhat flexible on the dates if you want to wait and
 start after the New Year)
 Payment: flat sum of $2500 upon completion of the work

 Required skills and knowledge:

   *   Familiarity with the MARC 21 bibliographic format
   *   Familiarity with Natural Language Processing concepts (or
 willingness to learn)
   *   Experience with Java, Python, and/or Ruby programming languages

 Description of work: Use language and text processing tools and provided
 strategies to write code to extract and normalize data in existing MARC
 bibliographic records for moving images. Refine code based on feedback from
 analysis of results obtained with a sample dataset.

 Data to be extracted:
 Tasks for Position 1:
 Titles (including the main title of the video, uniform titles, variant
 titles, series titles, television program titles and titles of contents)
 Authors and titles of related works on which an adaptation is based
 Duration
 Color
 Sound vs. silent
 Tasks for Position 2:
 Format (DVD, VHS, film, online, etc.)
 Original language
 Country of production
 Aspect ratio
 Flag for whether a record represents multiple works or not
 We have already done some work with dates, names and roles and have a
 framework to work in. I have the basic logic for the data extraction
 processes, but expect to need some iteration to refine these strategies.

 To apply please send me an email at kelleym@uoregon explaining why you
 are interested in this project, what relevant experience you would bring
 and any other reasons why I should hire you. If you have a preference for
 position 1 or 2, let me know (it's not necessary to have a preference). The
 deadline for applications is Monday, December 2, 2013. Let me know if you
 have any questions.

 Thank you for your consideration.

 Kelley

 PS In the near future, I will also be looking for someone to help with
 work clustering based on title, name, date and identifier data from MARC
 records. This will not involve any direct interaction with MARC.


 Kelley McGrath
 Metadata Management Librarian
 University of Oregon Libraries
 541-346-8232
 kell...@uoregon.edu