Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread K.G. Schneider
> Although, at the same time, I think Google has taught us that our result
> set
> order doesn't have to be perfect.  It just has to be 'relatively accurate'
> and present enough information to let the user determine its relevance.

Do users actually "determine relevance" or do they have faith in Google to
provide the best results on the first results page?

Karen G. Schneider
[EMAIL PROTECTED]


Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Ross Singer
Although, at the same time, I think Google has taught us that our result set
order doesn't have to be perfect.  It just has to be 'relatively accurate'
and present enough information to let the user determine its relevance.

I think a dependence on technology to 'solve this problem' is more
complicated than necessary.  Humans tend to be adaptable and (within reason)
fault tolerant.

-Ross.

On 4/11/06, Alexander Johannesen <[EMAIL PROTECTED]> wrote:
>
> On 4/12/06, Jonathan Rochkind <[EMAIL PROTECTED]> wrote:
> > If you are instead using a formula where an increased
> > number of records for a given work increases your ranking, all other
> > things being equal---I'm skeptical.
>
> Ditto; I think the "answer" to this is that there needs to be some
> serious pre-processing and analysis to come up with some really smarts
> in terms of these searches. I don't think there is an easy way out
> once you've gone past the "ooh, shiny" stage of whatever context you
> bring the user; good or bad context?
>
>
> Alex
> --
> "Ultimately, all things are known because you want to believe you know."
>  - Frank Herbert
> __ http://shelter.nu/ __
>
>


Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Alexander Johannesen
On 4/12/06, Jonathan Rochkind <[EMAIL PROTECTED]> wrote:
> If you are instead using a formula where an increased
> number of records for a given work increases your ranking, all other
> things being equal---I'm skeptical.

Ditto; I think the "answer" to this is that there needs to be some
serious pre-processing and analysis to come up with some really smarts
in terms of these searches. I don't think there is an easy way out
once you've gone past the "ooh, shiny" stage of whatever context you
bring the user; good or bad context?


Alex
--
"Ultimately, all things are known because you want to believe you know."
 - Frank Herbert
__ http://shelter.nu/ __


Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Jonathan Rochkind

To me, the big deal about Google's pagerank is often missed. Sure, to
some extent a link from page A to page B is a 'vote' for page B, and
I suppose a holdings count is roughly equivalent to that. But more
importantly, from my point of view, Google realized that the _link
text_ in the link from A to B was descriptive metadata about B. It
was a vote not just for B being "good", but for B being _about_ the
words contained in the incoming link text.

There's really no way to duplicate that with a library catalog. It's
an artifact of the nature of the web. But it's what Google's real
genius was---certainly Google's algorithm isn't always going to put
the web equivalent of the bible (Google.com itself? :) ) at the top
of all of your queries that contain that page in the result
set---only at the top of the querries whose text matches incoming
link text to that page (among many other things; this is an
oversimplification).  There is a lot going on in Google's relevancy
rankings in addition to just putting 'popular' pages on top, and most
of it is about trying to gauge the relevancy of the page to the
user's query.  Using techniques that may not be available in a
catalog as opposed to the web.

So I'd say, be careful what lesson you draw from Google.

It seems to me less than clear that creating a new edition of a work
is a 'vote' for it.  But worse than that, to some extent how many
records exist in our catalog that collate to the same FRBR 'work' is
an artifact of the cataloging rules. A given work might have been in
continuous publication since 1912 and have sold millions of copies,
but only have one record. Another work might have been published only
three years ago and sold tens of thousand of copies, but have
multiple records in the catalog because the publisher changed just
enough in a new 'edition' every year to trigger the creation of a new
record by catalogers attempting (successfully or not!) to follow
standards for when a new edition is 'different enough'  to justify a
new record. (Many college textbooks would end up like this, but most
libraries don't hold college textbooks). WorldCat, of course, can
contain multiple records for the _exact same_ edition, due to
cataloger error. This doesn't matter when you are just summing the
holdings count, as OCLC is, for a ranking---because whether it's one
record with 100 holdings or 10 records with 10 holdings each, your
total count is the same.  The formula "sum of holdings" isn't
effected by the number of records these holdings are distributed
amongst.  If you are instead using a formula where an increased
number of records for a given work increases your ranking, all other
things being equal---I'm skeptical.

--Jonathan

At 2:48 PM -0400 4/11/06, Keith Jenkins wrote:

A very interesting discussion here... so I'll support its funding with
my own two cents.

I'd argue that search relevance is a product of two factors:
  A. The overall popularity of an item
  B. The appropriateness to a given query

Both are approximate measures with their own difficulties, but a good
search usually needs to focus on both (unless B is so restrictive that
we don't need A).

B is always going to be inhibited, to various degrees, by the limited
nature of the user's input--usually just a couple of words.  If a user
isn't very specific, then it is indeed quite difficult to determine
what would be most relevant to that user.  That's where A can really
help to sort a large number of results (although B can also help
sorting).  I think Thom makes a good point here:

On 4/10/06, Hickey,Thom <[EMAIL PROTECTED]> wrote:

 Actually, though, 'relevancy' ranking based on where terms occur in the
 record and how many times they occur is of minor help compared to some
 sort of popularity score.  WorldCat holdings work fairly well for that,
 as should circulation data.


In fact, it was this sort of "popularity score" logic that originally
enabled Google to provide a search engine far better than what was
possible using just term placement and frequency metrics for each
document.  Word frequency is probably useless for our short
bibliographic records that are often cataloged at differing levels of
completeness.  But I think it could still be useful to give more
weight to the title and primary author of a book.

The basic mechanism of Google's PageRank algorithm is this: a link
from page X to page Y is a vote by X for Y, and the number of votes
for Y determines the power of Y's vote for other pages.  We could
apply this to FRBR records, if we think of every FRBR relationship as
a two-way link.  In this way, all the items link to the
manifestations, which link to the expressions, which link to the
works.  All manner of derivative works would also be linked to the
original works.  So the most highly-related works get ranked the
highest.  (For the algorithmically-minded, I found the article "XRANK:
Ranked Keyword Search over XML Documents" helpful in understanding how
the PageRank algorithm can be a

Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Alexander Johannesen
On 4/12/06, Keith Jenkins <[EMAIL PROTECTED]> wrote:
> I'd argue that search relevance is a product of two factors:
>   A. The overall popularity of an item
>   B. The appropriateness to a given query

I'd argue that people either know what they're after or they don't,
and as such you can prioritize A and B in any order. In a research
library, B might be more important than A, but in a public library
possibly the opposite.

With every search there is more than one topic of interest; clustering
often expands on these topics, but the problem is often than people
either want everything around the topic, or everything inside the
topic. Various ways to define these clusterspaces are actually very
interesting; we're experimenting with having both as distinct cluster
clouds allowing users to drill into the "outward" looking or "inward"
looking clusters of topics.

I think the main thing here is to remember that general searches are
easy, specific is hard, and so it is with finding out what information
sets to present to the user.


Alex
--
"Ultimately, all things are known because you want to believe you know."
 - Frank Herbert
__ http://shelter.nu/ __


Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Colleen Whitney

Jonathan Rochkind wrote:


not the right approach. And yet...I wish I could explain why it seems as
though the clustering can tell us something.



Well, what is it you think the clustering can tell you something
_about_?  This is an interesting topic to me.

I'm not sure the clustering can tell you anything about relevance to
the user. I'm not seeing it. I mean, the number of items that are
members of a FRBR work set really just indicates how many 'versions'
(to be imprecise) of that work exist. But the number of 'versions' of
a work that exist doesn't really predict how likely that work (or any
of it's versions) is to be of interest to a user, does it?  But maybe
you're thinking of something I'm missing, I'm curious what you're
thinking about.


Yes, that's exactly what I'm stuck on.  If "more important" or "more
popular" works tend to have more manifestations, then there might be
some signal as to probability of relevance in there.  Which could be
factored in (in some *small* way).  But I'm not sure whether/how one
would test that "if".  At the moment you have me convinced that it's a
red herring.


<...>
So many questions. But that's what makes it interesting. I am very
interested in checking out the system you end up with, Colleen, it
sounds interesting. If it's publically internet accessible, please do
share it with us when there's something interesting to look at.


Here's a URL FWIW at this point in time---with a whole bunch of
caveats..FRBRization experiments are not reflected in there, the UI is
partially baked, etc. etcand things will still be changing for the
next few months.  When it's closer to baked, I'll send it out again.

http://recommend-dev.cdlib.org/xtf/search?style=melrec&brand=melrec


Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Keith Jenkins
A very interesting discussion here... so I'll support its funding with
my own two cents.

I'd argue that search relevance is a product of two factors:
  A. The overall popularity of an item
  B. The appropriateness to a given query

Both are approximate measures with their own difficulties, but a good
search usually needs to focus on both (unless B is so restrictive that
we don't need A).

B is always going to be inhibited, to various degrees, by the limited
nature of the user's input--usually just a couple of words.  If a user
isn't very specific, then it is indeed quite difficult to determine
what would be most relevant to that user.  That's where A can really
help to sort a large number of results (although B can also help
sorting).  I think Thom makes a good point here:

On 4/10/06, Hickey,Thom <[EMAIL PROTECTED]> wrote:
> Actually, though, 'relevancy' ranking based on where terms occur in the
> record and how many times they occur is of minor help compared to some
> sort of popularity score.  WorldCat holdings work fairly well for that,
> as should circulation data.

In fact, it was this sort of "popularity score" logic that originally
enabled Google to provide a search engine far better than what was
possible using just term placement and frequency metrics for each
document.  Word frequency is probably useless for our short
bibliographic records that are often cataloged at differing levels of
completeness.  But I think it could still be useful to give more
weight to the title and primary author of a book.

The basic mechanism of Google's PageRank algorithm is this: a link
from page X to page Y is a vote by X for Y, and the number of votes
for Y determines the power of Y's vote for other pages.  We could
apply this to FRBR records, if we think of every FRBR relationship as
a two-way link.  In this way, all the items link to the
manifestations, which link to the expressions, which link to the
works.  All manner of derivative works would also be linked to the
original works.  So the most highly-related works get ranked the
highest.  (For the algorithmically-minded, I found the article "XRANK:
Ranked Keyword Search over XML Documents" helpful in understanding how
the PageRank algorithm can be applied to other situations:
http://www.cs.cornell.edu/~cbotev/XRank.pdf )  It would be interesting
to see how such an approach compares to a simple tally of "number of
versions".

-Keith


[CODE4LIB] METS Navigator release 1.0 beta

2006-04-11 Thread Riley, Jenn
This message is being posted to multiple lists; please excuse any duplicate 
messages you may receive.

The Indiana University Digital Library Program is pleased to announce the 
release of METS Navigator 1.0 Beta, a METS-based system for displaying and 
navigating sets of page images or other multi-part digital objects. More 
information, documentation, and downloads are available at 
.

A METS profile for documents that can be used with METS Navigator is in 
development, and will be released for comment shortly.

Jenn Riley



Jenn Riley
Metadata Librarian
Digital Library Program
Indiana University - Bloomington
Wells Library E170
(812) 856-5759
www.dlib.indiana.edu

Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com


Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-11 Thread Jonathan Rochkind

not the right approach. And yet...I wish I could explain why it seems as
though the clustering can tell us something.


Well, what is it you think the clustering can tell you something
_about_?  This is an interesting topic to me.

I'm not sure the clustering can tell you anything about relevance to
the user. I'm not seeing it. I mean, the number of items that are
members of a FRBR work set really just indicates how many 'versions'
(to be imprecise) of that work exist. But the number of 'versions' of
a work that exist doesn't really predict how likely that work (or any
of it's versions) is to be of interest to a user, does it?  But maybe
you're thinking of something I'm missing, I'm curious what you're
thinking about.

I am a big fan of grouping items into FRBR work sets however, for
other reasons (which may or may not be obvious). But exactly how this
should be done, under what control of the user, is still an open
question to some extent (that will only be answered after more
systems try and it experiment with it).

I wonder---if more than one item in the FRBR work set has an
especially high relevancy ranking (I don't know what would qualify as
'especially high')---if more than one of those items should in fact
be "brought to the top", and hilighted on the first-level display?
Instead of making the user "click through" to see them?  But I'm
assuming you hide members of a FRBR work grouping behind a single
heading on the first level results---you may not do this, you may
just group them adjacently but put all of them on the first level
result list? There are a million ways to do these things.

And, as Thom mentions, it's also something of an open question as to
what the right way to do relevancy rankings based on bib records is
anyway (or if there even is a right way).

So many questions. But that's what makes it interesting. I am very
interested in checking out the system you end up with, Colleen, it
sounds interesting. If it's publically internet accessible, please do
share it with us when there's something interesting to look at.

--Jonathan




--Colleen

David Walker wrote:


The only tricky thing about this with WorldCat, though, is that you have
such a large mix of libraries.

In my own searching on WorldCat, I've noticed that a fair amount of
fiction and non-scholarly works appear near the top of results because
the public libraries are skewing the holdings of those titles.

Not a bad thing in itself, if that's what I'm looking for, but our
students are looking for scholarly works (and still learning to
distinguish scholarly from not), so would be nice in our particular
context to limit only to academic libraries that own the title.

--Dave

=
David Walker
Web Development Librarian
Library, Cal State San Marcos
760-750-4379
http://public.csusm.edu/dwalker
=





-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Hickey,Thom
Sent: Monday, April 10, 2006 12:52 PM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] Question re: ranking and FRBR

I'd agree with this.

Actually, though, 'relevancy' ranking based on where terms occur in the
record and how many times they occur is of minor help compared to some
sort of popularity score.  WorldCat holdings work fairly well for that,
as should circulation data.  The primary example of this sort of ranking
is the web search engines where ranking is based primarily on word
proximity and links.

--Th


-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Jonathan Rochkind
Sent: Monday, April 10, 2006 3:16 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Question re: ranking and FRBR

When you are ranking on number of holdings like OCLC is, a straight
sum makes sense to me---the sum of all libraries holding copies of
any manifestation of the FRBR work is indeed the sum of the holdings
for all the records in the FRBR work set. Of course.

If you're doing relavancy rankings instead though, a straight sum
makes less sense. A relevancy ranking isn't really amenable to being
summed. The sum of the relevancy rankings for various
manifestations/expressions is not probably not a valid indicator of
how relevant the work is to the user, right?  And if you did it this
way, it would tend to make the most _voluminous_ work always come out
first as the most 'relevant', which isn't quite right.---This isn't
quite the same problem as OCLC's having the bible come out on
top---since OCLC is ranking by holdings, it's exactly right to have
the bible come out on top, the Bible is indeed surely one of the
(#1?) most held works, so it's quite right for it to be on top. But
the bible isn't always going to be the most relevant result for a
user, just because it's the most voluminous!  Summing is going to
mess up your relevancy rankings.

Just using the maximum relevancy ranking from the work set seems
acceptable to me--the work's relevancy to the user