Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls
from my POV.
Greetings,
Stefan
During AIRWeb'06 we announced the availability of the collection.
We are currently planning a Web Spam challenge based on the dataset we
have built. I assume most of you will be interested on this, so I have
moved the "webspam-volunteers" list to "webspam-announces". If you do
not want to be in this new "webspam-announces" list, please send me an
e-mail.
This was shown during AIRWeb in Seattle:
.............................................................
Web Spam Collection Available
August 10th, 2006
We are pleased to announce the availability of a public collection for
research on Web spam. This collection is the result of efforts by a
team of volunteers:
Thiago Alves Antonio Gulli Tamas Sarlos
Luca Becchetti Zoltan Gyongyi Mike Thelwall
Paolo Boldi Thomas Lavergn Belle Tseng
Paul Chirita Alex Ntoulas Tanguy Urvoy
Mirel Cosulschi Josiane-Xavier Parreira Wenzhong Zhao
Brian Davison Xiaoguang Qi
Pascal Filoche Massimo Santini
The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts
downloaded in May 2006 by the Laboratory of Web Algorithmics,
Universit{\`a} degli Studi di Milano. The labelling process was
coordinated by Carlos Castillo working at the Algorithmic Engineering
group at Universit{\`a} di Roma ``La Sapienza'' The project was funded
by the DELIS project (Dynamically Evolving, Large Scale Information
Systems).
Volunteers were provided with a set of guidelines and were asked to
mark a set of hosts as either normal, spam, or borderline. The
collection includes about 6,700 judgments done by the volunteers and
can be used for testing link-based and content-based Web spam
detection and demotion techniques.
More information is available in our Web page, including the
guidelines given to the human judges, the instructions for obtaining
the links and contents of the pages in this collection, and the
contact information for questions and comments.
http://aeserver.dis.uniroma1.it/webspam/
If you use this data set please subscribe to our mailing list by
sending an e-mail to [EMAIL PROTECTED]
--
Carlos Castillo
Universita di Roma "La Sapienza"
Rome, ITALY
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/webspam-announces/
<*> To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/