On Sat, Aug 1, 2009 at 1:23 AM, Andrzej Bialecki<[email protected]> wrote: > Grant Ingersoll wrote: >> >> OK, so how do we get this started? Seems like there are a lot of >> collections out there we could use. Also, we can crawl. Seems the tricky >> part is getting judgments. > > I think we should establish first what kind of relevance judgments we want > to collect: This looks like two different things. One thing is deciding what we use to get "a" collection of documents - a corpus. It seems to be a very good idea to me to create a heterogeneous collection of documents such as wikipedia to kick off ORP. I guess we do not need a huge collection of documents to get started, right?! @Grant: I might have missed something but have we a list of available collections on some wiki page?! Would be great to have something like that. Once we got this project going we can start building various collections from all kinds of areas. I found it interesting that all collections I have seen are build from large documents but with the advent of mobile devices collections could also be build from "data-records" like SMS, Address-Records, image-metadata, audio-metadata where the text/document is relatively small. I found that searching on such "small" document puts different requirements on scoring parameters than websearch...
Another thing is what we do with this collections. I kind of like the idea of having something like a webapp that is able to preform corpus selection, distance measurement etc. I wanna extend Andrzej's list and throwing out some random thoughts... - It would be nice to have something like a immediate representation of a corpus that can be plugged into a relevance measurement app / webapp. - such a relevance measurement should be able to work on top of custom search applications. There could be an API which give applications access to the corpus for indexing and can search on this corpus through the API. I can imagine lots of usecases where users want to judge their custom search engine against a corpus and compare the results. simon (in the middle of moving his apartment) > > 1. given a corpus, and a query, define an ordered list of top-N documents > that are relevant to the query. This is our baseline. Getting this sort of > information is very time-consuming and subjective. > > 2. given a corpus, a query and a list of top-N results obtained from a real > search, define what results are relevant and how they should be ordered. The > reviewed list of top-N results becomes then the initial approximation of our > baseline. Calculate a distance metric between real and reviewed result, and > adjust ranking to maximize this metric. > > The second scenario could be handled by a webapp, which could present the > following areas of functionality: > > * corpus selection and browsing > > * searching using selected search impl and its ranking parameters, and > storing tuples of <corpus, impl, query, results> > > * review of the results (marking relevant / non-relevant, reordering), and > saving of tuples <corpus, impl, query, reviewed results> > > * calculation of distance metrics. > > * adjustment of ranking parameters for a given search implementation. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
