FYI
Begin forwarded message:
From: Richi Nayak <[email protected]>
Date: June 25, 2009 9:47:42 PM EDT
To: "[email protected]" <[email protected]>
Cc: Richi Nayak <[email protected]>
Subject: [SIG-IRList] CFP: INEX 2009 - Clustering Task for
collection Selection
This is a call for participation in XML Clustering Task in INEX
2009. INEX 2009 clustering task is an evaluation forum that provides
a platform to measure the performance of clustering methods for
collection selection on a huge scale test collection (consisting of
a set of documents, their labels, a set of information needs
(queries), and the answers to those information needs).
In the last decade, we have observed a proliferation of approaches
for clustering XML documents based on their structure and content.
There have been many approaches developed for diverse application
domains. Many applications require data objects to be grouped by
similarity of content, tags, paths, structure and semantics.
The clustering task in INEX 2009 evaluates unsupervised machine
learning in the context of XML information retrieval. This year we
are running a novel evaluation task using manual query assessments
from the INEX Ad Hoc track. The clustering track will explicitly
test the Jardine and van Rijsbergen cluster hypothesis (1971), which
states that documents that cluster together have a similar relevance
to a given query. The task is to split the English Wikipedia
collection, 60 Gigabytes in size having around 2.7 million documents
in XML format, into disjoint clusters for collection selection. If
the cluster hypothesis holds true, and if suitable clustering can be
achieved, then a clustering solution will minimise the number of
clusters that need to be searched to satisfy any given query. There
are important practical reasons for performing collection selection
on a very large corpus. If only a small fraction of clusters (hence
documents) need to be searched, then the throughput of an
information retrieval system will be greatly improved.
The INEX XML Wikipedia collection is a marked-up version of the
Wikipedia documents. The mark-up includes, for instance, explicit
tagging of named entities. In order to enable participation with
minimal overheads in data-preparation the collection has been pre-
processed to provide various representations of the documents. For
instance, a bag-of-words representation of terms and frequent
phrases in a document, frequencies of various XML structures in the
form of trees, links, named entities, etc. These various collection
representations will be released by the end of this month. As well,
the entire document collection is available in XML format and in
text-only format if you wish to try different representation
approaches. A subset of collection containing about 50,000 documents
(of the INEX 2009 corpus) will also be provided, in order to cluster
them, for teams that are unable to process such a large data
collection.
The clustering solutions will be evaluated by two means. Firstly,
the clustering solution will be evaluated by using the standard
criteria such as purity, entropy and F-score to determine the
quality of clusters. These evaluation results will be provided
online and ongoing along the same lines as NetFlix, starting from
mid-September. Secondly, the clustering solutions will be evaluated
to determine the quality of cluster relative to the optimal
collection selection goal, given a set of queries. Better
clustering solutions in this context will tend to (on average) group
together relevant results for (previously unseen) ad-hoc queries.
Real Ad-hoc retrieval queries and their manual assessment results
will be utilised in this evaluation. This novel approach evaluates
the clustering solutions relative to a very specific objective -
clustering a large document collection in an optimal manner in order
to satisfy queries while minimising the search space. Results of
second evaluation will be released at the INEX workshop in December.
The clustering task in INEX 2009 brings together researchers from
Information Retrieval, Data Mining, Machine Learning and XML fields.
It allows participants to evaluate clustering methods against a
real use case and with significant volumes of data. The task is
designed to facilitate participation with minimal effort by
providing not only raw data, but also pre-processed data which can
be easily used by existing clustering software.
Dr Richi Nayak, School of Information Technology,
Queensland University of Technology, Brisbane, QLD 4001
Office: GP S537 Phone: 3138 1976
Email: [email protected]
http://sky.scitech.qut.edu.au/~nayak/
************************************************
This SIGIR-IRList message and the SIG-IRList Digest (a moderated IR
newsletter), are brought to you by SIGIR, distributed from the
University of Sheffield and edited by Raman Chandrasekar ([email protected]
).
o To submit an article, e-mail [email protected]
o To subscribe, send mail to [email protected] , with the
subject: SUBSCRIBE irlist firstname lastname
o To unsubscribe, send mail to [email protected], with the
subject: UNSUBSCRIBE irlist email
[The email address is required only if you want to unsubscribe with
an address other than the address with which you send the message]
o For more info, visit: http://www.sigir.org/sigirlist/
o Subscribe to a feed of these messages at
http://searchtextmining.spaces.live.com/feed.rss
These files are not to be sold or used for commercial purposes.
THE OPINIONS EXPRESSED WITHIN THIS DOCUMENT DO NOT REPRESENT THOSE
OF THE EDITOR, MICROSOFT CORPORATION OR THE UNIVERSITY OF SHEFFIELD.
AUTHORS ASSUME FULL RESPONSIBILITY FOR THEIR MATERIAL.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search