On Sat, Aug 1, 2009 at 1:23 AM, Andrzej Bialecki<[email protected]> wrote:
> Grant Ingersoll wrote:
>>
>> OK, so how do we get this started?  Seems like there are a lot of
>> collections out there we could use.  Also, we can crawl.  Seems the tricky
>> part is getting judgments.
>
> I think we should establish first what kind of relevance judgments we want
> to collect:
This looks like two different things.
One thing is deciding what we use to get "a" collection of documents -
a corpus. It seems to be a very good idea to me to create a
heterogeneous collection of documents such as wikipedia to kick off
ORP. I guess we do not need a huge collection of documents to get
started, right?!
@Grant: I might have missed something but have we a list of available
collections on some wiki page?! Would be great to have something like
that.
Once we got this project going we can start building various
collections from all kinds of areas. I found it interesting that all
collections I have seen are build from large documents but with the
advent of mobile devices collections could also be build from
"data-records" like SMS, Address-Records, image-metadata,
audio-metadata where the text/document is relatively small. I found
that searching on such "small" document puts different requirements on
scoring parameters than websearch...

Another thing is what we do with this collections. I kind of like the
idea of having something like a webapp that is able to preform corpus
selection, distance measurement etc.
I wanna extend Andrzej's list and throwing out some random thoughts...

- It would be nice to have something like a immediate representation
of  a corpus that can be plugged into a relevance measurement app /
webapp.

- such a relevance measurement should be able to work on top of custom
search applications. There could be an API which give applications
access to the corpus for indexing
and can search on this corpus through the API. I can imagine lots of
usecases where users want to judge their custom search engine against
a corpus and compare the results.


simon (in the middle of moving his apartment)

>
> 1. given a corpus, and a query, define an ordered list of top-N documents
> that are relevant to the query. This is our baseline. Getting this sort of
> information is very time-consuming and subjective.
>
> 2. given a corpus, a query and a list of top-N results obtained from a real
> search, define what results are relevant and how they should be ordered. The
> reviewed list of top-N results becomes then the initial approximation of our
> baseline. Calculate a distance metric between real and reviewed result, and
> adjust ranking to maximize this metric.
>
> The second scenario could be handled by a webapp, which could present the
> following areas of functionality:
>
> * corpus selection and browsing
>
> * searching using selected search impl and its ranking parameters, and
> storing tuples of <corpus, impl, query, results>
>
> * review of the results (marking relevant / non-relevant, reordering), and
> saving of tuples <corpus, impl, query, reviewed results>
>
> * calculation of distance metrics.
>
> * adjustment of ranking parameters for a given search implementation.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to