DBpedia 3.8 released, including enlarged Ontology and additional localized Versions

2012-08-06 Thread Chris Bizer
 file system space, the framework can compress DBpedia triple files
while writing and decompress Wikipedia XML dump files while reading
• Using some bit twiddling, we can now load all ~200 million inter-language
links into a few GB of RAM and analyze them
• Users can download ontology and mappings from mappings wiki and store them
in files to avoid downloading them for each extraction, which takes a lot of
time and makes extraction results less reproducible
• We now use IRIs for all languages except English, which uses URIs for
backwards compatibility
• We now resolve redirects in all datasets where the objects URIs are
DBpedia resources
• We check that extracted dates are valid (e.g. February never has 30 days)
and its format is valid according to its XML Schema type, e.g.
xsd:gYearMonth
• We improved the removal of HTML character references from the abstracts
• When extracting raw infobox properties, we make sure that predicate URI
can be used in RDF/XML by appending an underscore if necessary
• Page IDs and Revision IDs datasets now use the DBpedia resource as subject
URI, not the Wikipedia page URL 
• We use foaf:isPrimaryTopicOf instead of foaf:page for the link from
DBpedia resource to Wikipedia page
• New inter-language link datasets for all languages



Accessing the DBpedia 3.8 Release

You can download the new DBpedia dataset from
http://dbpedia.org/Downloads38.

As usual, the dataset is also available as Linked Data and via the DBpedia
SPARQL endpoint at http://dbpedia.org/sparql


Credits

Lots of thanks to

• Jona Christopher Sahnwaldt (Freie Universität Berlin, Germany) for
improving the DBpedia extraction framework and for extracting the DBpedia
3.8 data sets.
• Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for
implementing the language generalizations to the extraction framework.
• Uli Zellbeck and Anja Jentzsch (Freie Universität Berlin, Germany) for
generating the new and updated RDF links to external datasets using the Silk
interlinking framework.
• Jonas Brekle (Universität Leipzig, Germany) and Sebastian Hellmann
(Universität Leipzig, Germany) for their work on the new Wikionary2RDF
extractor.
• All editors that contributed to the DBpedia ontology mappings via the
Mappings Wiki.
• The whole Internationalization Committee for pushing the DBpedia
internationalization forward.
• Kingsley Idehen and Patrick van Kleef (both OpenLink Software) for loading
the dataset into the Virtuoso instance that serves the Linked Data view and
SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether
for providing the server infrastructure for DBpedia.
The work on the DBpedia 3.8 release was financially supported by the
European Commission through the projects LOD2 - Creating Knowledge out of
Linked Data (http://lod2.eu/, improvements to the extraction framework) and
LATC - LOD Around the Clock (http://latc-project.eu/, creation of external
RDF links).


More information about DBpedia is found at http://dbpedia.org/About


Have fun with the new DBpedia release!

Cheers,

Chris Bizer





Open Positions: 1 Postdoc Researcher and 2 PhD Students at Web-based Systems Group / University of Mannheim

2012-06-27 Thread Chris Bizer
Hi all,

I'm looking for a postdoc researcher and two PhD students to join the
Web-based Systems Group at the University of Mannheim. Please find the job
postings below as well as at

http://wifo5.informatik.uni-mannheim.de/de/lehrstuhl/offene-stellen

The Web-based Systems Group is part of the new research focus area on Data-
and Web-Science that is currently being established at the School of
Business Informatics and Mathematics of the University of Mannheim. For more
information about the focus area see

http://midas.informatik.uni-mannheim.de/


1. PostDoc Researcher (Akademische Rätin/Akademischer Rat)

The future holder of this position should have a proven academic record in
one or several of the following areas:

+ Web Data Integration
+ Web Mining
+ Linked Data and Semantic Web Technologies
+ Data Quality Assessment

The researcher is expected to

+ carry out research in the thematic context of the group 
+ support the development and organization of the research group 
+ contribute to the open data publishing and open source software projects
of the group
+ contribute to the teaching program of the group on master- and
bachelor-level 

We expect candidates to have a PhD degree in computer science or a related
discipline, have good programming skills and be fluent in English.  The
position should ideally be filled 1st of September 2012. The position is
temporary with an initial duration of 3 years and an option of a 3 year
extension after the successful completion of the first period. The position
is paid according to German civil servants' standards (TV-L E14).


2. Graduate Research Associate (Wissenschaftliche Mitarbeiterin /
Wissenschaftlicher Mitarbeiter)

The graduate research associate is expected to

+ carry out research in the thematic context of the group and present the
result at international conferences
+ write a PhD on a topic connected to the focus of the research group
+ contribute to the open data publishing and open source software projects
of the group
+ support the teaching activities of the group on master- and bachelor-level

Experience in scientific work as well as proven knowledge in one or several
of the following areas are of advantage:

+ Web Data Integration
+ Web Mining
+ Linked Data and Semantic Web Technologies
+ Data Quality Assessment

We expect candidates to have a university degree in computer science or a
related discipline, good programming skills as well as good knowledge of the
English language in reading and writing. The position should ideally be
filled 1st of September 2012. The position is temporary with an initial
duration of 2 years and an option of a 3 year extension after the successful
completion of the first period. The position is paid according to German
civil servants' standards (TV-L E13).

3. Graduate Research Associate (Wissenschaftliche Mitarbeiterin /
Wissenschaftlicher Mitarbeiter) – Project DM2E

The graduate research associate will be one of two researchers working on
the DM2E - Digitized Manuscripts to Europeana  project.  The project extends
the digital library Europeana with Linked Data space for the flexible and
open collaboration of Digital Humanities researchers. 

The graduate research associate is expected to

+ take a leading role in the development of the D2ME interoperability
infrastructure including RDFization, schema mapping, identity resolution and
interlinking
+ have proven skills in XML- and Linked Data-technologies as well as Java or
Scala programming
+ write a PhD on a topic connected with the project or other topics of the
research group

We expect candidates to have a university degree in computer science or a
related discipline as well as good knowledge of the English language in
reading and writing.  The position should be filled 1st of September 2012
and is available until the end of the project in February 2015. The position
is paid according to German civil servants' standards (TV-L E13).

Work Environment

The Web-based Systems Group explores technical and economic questions
concerning the development of global, decentralized information
environments.  Our current research focuses are:

+ The evolution of the World Wide Web from a medium for the publication of
textual documents into a medium for sharing structured data, as well as the
role of Linked Data technologies within this transition.
+ The shift from classic data integration architectures to Enterprise Data
Spaces which provide for the flexible, pay-as-you-go integration of large
numbers of internal and external data sources.

The group has a strong focus on making research results accessible as open
data on the Web as well as in the form of open source software. Projects
that have been initialized by the group include DBpedia, Web Data Commons,
the W3C Linking Open Data effort, the LDIF – Linked Data Integration
Framework, and the Silk – Link Discovery Framework. 

The Web-based Systems Group is part of the Focus Area on Data- and
Web-Science that is curr

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-18 Thread Chris Bizer
Hi Martin and Peter,

cc'ing Ahad and Lisa from CommonCrawl 

> Hi Chris,
>
> Thanks for your e-mail. 
>
>> we clearly say on the WebDataCommons website as well as in the 
>> announcement that we are extracting data from 1.4 billion web pages only.
>> 
>> The Web is obviously much larger. Thus it is also obvious that we 
>> don't have all data in our dataset.
>
> It's not about the fact that you are using a subset of the Web, but that
that subset 
> is likely an unsuited sample from the population for many of the
conclusions you 
> derive, in particular speaking about the data Web.

Drawing conclusions from a sample is of course always questionable and
obviously it would be better if there would be a public 10 billion or 50
billion pages crawl available that we could analyze. But up-till-now such a
crawl does not exist. Thus, analyzing what we have is as good as we
currently can get based on publically accessible corpora.

In order to have a second source of evidence, I asked Peter to derive
statistics from (a subset of?) the Yahoo!/Bing crawl and he was so nice to
also provide these statistics for LDOW:

http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf

His sample is bigger (3.4 billion pages gathered using a different crawling
strategy) and you can clearly see from the results that the crawling
strategy highly influences the results.

Up till now, Peter's statistics don't contain counts for specific classes.
Having them and comparing them to the WebDataCommons statistics would of
course be very interesting.

Peter: Do you see any chances that you still generate instance per class
counts after being back from WWW2012?

>> I agree with you that a crawler that would especially look for data 
>> would use a different crawling strategy.
>
> I (and likely many others) understood from your marketing and your slides
that you 
> were actually looking for data, and the core of my comments regarding
webdatacommons.org 
> was that the approach taken has a fundamental problem of reaching the data
due to the
> inappropriate filter by pagerank in the underlying CommonCrawl corpus.
>
> As for providing seed URLs: The problem is that many sites will have data
markup ONLY 
> in the deep pages, so if they are not included in your data, you will not
even know 
> whether it pays out to try a particular site.

As far as I understood from an earlier email from Ahad, page rank is not the
only factor that the CC crawler uses for deciding on how deep to dig into a
specific website.

Ahad and Lisa: There is currently a discussion on some Semantic Web mailing
lists about what pages are likely to be included into the CommonCrawl. See:
http://lists.w3.org/Archives/Public/public-lod/2012Apr/thread.html

In order to clear up things, would it be possible that you give us some more
information about the CC crawling strategy and the factors that determine
how many pages are crawled per website?

>> Thus if you don't like the CommonCrawl crawling strategy, you are 
>> highly invited to change the ranking algorithm in any way you like, 
>> dig deeper into the websites that we identified and publish the resulting
data.
>
> I have clearly articulated that I think both CommonCrawl and
WebDataCommons are in 
> principle nice pieces of work.
>
> The only thing I did not like is that you do not discuss the limitations
of your analysis, 
> neither in the paper nor on the slides, which leads to many people drawing
the wrong 
> conclusions from your findings or even investing time and money into doing
something 
> directly on that data, which cannot work.

We will mention these limitations in further publications and presentations
of the WDC statistics.

>> This would be a really useful service to the community in addition to 
>> criticizing other people's work.
>
> Criticizing other people's work is the daily business of scientific
advancement and 
> while maybe unpleasant to the recipient indeed a useful service to the
community. 
> But I think you know that.

Sure, that why I write "in addition to".

Cheers,

Chris


> Martin


On Apr 18, 2012, at 12:11 AM, Chris Bizer wrote:

> Hi Martin,
> 
> we clearly say on the WebDataCommons website as well as in the 
> announcement that we are extracting data from 1.4 billion web pages only.
> 
> The Web is obviously much larger. Thus it is also obvious that we 
> don't have all data in our dataset.
> 
> See 
> http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0093.html for
the original announcement.
> 
> Quote from the announcement:
> 
> "We hope that Web Data Commons will be useful to the community by:
> 
> + easing the access to Mircodata, Mircoformat and RDFa data, as you do 
> + not
> need to crawl th

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Chris Bizer
Hi Martin,

we clearly say on the WebDataCommons website as well as in the announcement
that we are extracting data from 1.4 billion web pages only. 

The Web is obviously much larger. Thus it is also obvious that we don't have
all data in our dataset.

See http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0093.html for
the original announcement.

Quote from the announcement:

"We hope that Web Data Commons will be useful to the community by:

+ easing the access to Mircodata, Mircoformat and RDFa data, as you do not
need to crawl the Web yourself anymore in order to get access to a fair
portion of the structured data that is currently available on the Web.

+ laying the foundation for the more detailed analysis of the deployment of
the different technologies.

+ providing seed URLs for focused Web crawls that dig deeper into the
websites that offer a specific type of data."

Please notice the words "fair portion", "more detailed analysis" and "seed
URLs for focused Web crawls".

I agree with you that a crawler that would especially look for data would
use a different crawling strategy.

The source code of the CommonCrawl crawler as well as the WebDataCommons
extraction code is available online under open licenses.

Thus if you don't like the CommonCrawl crawling strategy, you are highly
invited to change the ranking algorithm in any way you like, dig deeper into
the websites that we identified and publish the resulting data. 

This would be a really useful service to the community in addition to
criticizing other people's work.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: Martin Hepp [mailto:martin.h...@unibw.de] 
Gesendet: Dienstag, 17. April 2012 15:26
An: public-voc...@w3.org Vocabularies; public-lod@w3.org; Chris Bizer
Betreff: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current
RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Dear Chris, all,

while reading the paper [1] I think I found a possible explanation why
WebDataCommons.org does not fulfill the high expectations regarding the
completeness and coverage.

It seems that CommonCrawl filters pages by Pagerank in order to determine
the feasible subset of URIs for the crawl. While this may be okay for a
generic Web crawl, for linguistics purposes, or for training
machine-learning components, it is a dead end if you want to extract
structured data, since the interesting markup typically resides in the *deep
links* of dynamic Web applications, e.g. the product item pages in shops,
the individual event pages in ticket systems, etc.

Those pages often have a very low Pagerank, even when they are part of very
prestigious Web sites with a high Pagerank for the main landing page.

Example:

1. Main page:   http://www.wayfair.com/ 
--> Pagerank 5 of 10

2. Category page:   http://www.wayfair.com/Lighting-C77859.html
--> Pagerank 3 of 10

3. Item page:
http://www.wayfair.com/Golden-Lighting-Cerchi-Flush-Mount-in-Chrome-1030-FM-
CH-GNL1849.html
--> Pagerank of 0 / 10

Now, the RDFa on this site is in the 2 Million item pages only. Filtering
out the deep link in the original crawl means you are removing the HTML that
contains the actual data.

In your paper [1], you kind of downplay that limitation by saying that this
approach yielded "snapshots of the popular part of the web.". I think
"popular" is very misleading in here because the Pagerank does not work very
well for the "deep" Web, because those pages are typically lacking external
links almost completely, and due to their huge number per site, they earn
only a minimal Pagerank from their main site, which provides the link or
links.

So, once again, I think your approach is NOT suitable for yielding a corpus
of usable data at Web scale, and the statistics you derive are likely very
much skewed, because you look only at landing pages and popular overview
pages of sites, while the real data is in HTML pages not contained in the
basic crawl.

Please interprete your findings in the light of these limitations. I am
saying this so strongly because I already saw many tweets cherishing the
paper as "now we have the definitive statistics on structured data on the
Web".


Best wishes

Martin

Note: For estimating the Pagerank in this example, I used the online-service
[2], which may provide only an approximation.


[1] http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-2.pdf

[2] http://www.prchecker.info/check_page_rank.php


martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  h...@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax: +49-(0)89-6004-4620
www: http://www.unibw.de/ebusiness/ (group)
 http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web

ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-03-22 Thread Chris Bizer
Hi all,

 

we are happy to announce WebDataCommons.org, a joined project of Freie
Universität Berlin and the Karlsruhe Institute of Technology to extract all
Microformat, Microdata and RDFa data from the Common Crawl web corpus, the
largest and most up-to-data web corpus that is currently available to the
public.

 

WebDataCommons.org provides the extracted data for download in the form of
RDF-quads. In addition, we produce basic statistics about the extracted
data.

 

Up till now, we have extracted data from two Common Crawl web corpora: One
corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a
second corpus consisting of 1.4 billion HTML pages dating from February
2012.

 

The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe
1.5 billion entities and originate from 19.1 million websites.

The February 2012 extraction resulted in 3.2 billion RDF quads which
describe 1.2 billion entities and originate from 65.4 million websites.

 

More detailed statistics about the distribution of formats, entities and
websites serving structured data, as well as growth between 2009/2010 and
2012 is provided on the project website:

 

http://webdatacommons.org/

 

It is interesting to see form the statistics that the RDFa and Microdata
deployment has grown a lot over the last years, but that Microformat data
still makes up the majority of the structured data that is embedded into
HTML pages (when looking at the amount of quads as well as the amount of
websites).

 

We hope that will be useful to the community by:

+ easing the access to Mircodata, Mircoformat and RDFa data, as you do not
need to crawl the Web yourself anymore in order to get access to a fair
portion of the structured data that is currently available on the Web.

+ laying the foundation for the more detailed analysis of the deployment of
the different technologies.

+ providing seed URLs for focused Web crawls that dig deeper into the
websites that offer a specific type of data.

 

Web Data Commons is a joint effort of Christian Bizer and Hannes Mühleisen
(Web-based Systems Group at Freie Universität Berlin) and Andreas Harth and
Steffen Stadtmüller (Institute AIFB at the Karlsruhe Institute of
Technology).

 

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.

+ the Any23 project for providing their great library of structured data
parsers.

+ the PlanetData and the LOD2 EU research projects which supported the
extraction.

 

For the future, we plan to update the extracted datasets on a regular basis
as new Common Crawl corpora are becoming available. We also plan to provide
the extracted data in the in the form of CSV-tables for common entity types
(e.g. product, organization, location, ...) in order to make it easier to
mine the data.

 

Cheers,

 

Christian Bizer, Hannes Mühleisen, Andreas Harth and Steffen Stadtmüller

 

 

--

Prof. Dr. Christian Bizer

Web-based Systems Group

Freie Universität Berlin

+49 30 838 55509

  http://www.bizer.de

  ch...@bizer.de

 



ANN: LDIF - Linked Data Integration Framework Version 0.4 Scale-Out released.

2012-01-16 Thread Chris Bizer
Hi all,

 

the Web-based Systems Group and our industry partner mes |semantics are
happy to announce the release of the LDIF – Linked Data Integration
Framework Version 0.4 Scale-Out.

 

LDIF can be used within Linked Data applications to translate heterogeneous
data from the Web of Linked Data into a clean local target representation
while keeping track of data provenance. LDIF translates data from the Web
into a consistent target vocabulary. LDIF includes an identity resolution
component which translates URI aliases into single target URI.

 

Up till now, LDIF stored data purely in-memory. This restricted the amount
of data that could be processed. 

 

LDIF Version 0.4 introduces two new implementations of the LDIF runtime
environment which allow LDIF to scale to large data sets: 

 

1. The new triple store backed implementation scales to larger data sets on
a single machine. 

2. The new Hadoop-based implementation provides for processing very large
data sets on a Hadoop cluster, for instance on Amazon EC2. 

 

We have tested LDIF for integrating RDF data sets ranging from 25 million to
3.6 billion triples. 
A comparison of the performance of all three implementations is found on the
LDIF benchmark page:

 

http://www.assembla.com/spaces/ldif/wiki/Benchmark

 

LDIF is provided under the terms of the Apache Software License. LDIF can be
downloaded from the project webpage which also provides detailed information
about the features and the configuration of the framework:

 

http://www4.wiwiss.fu-berlin.de/bizer/ldif/

 

The development of LDIF is supported in part by Vulcan Inc. as part of its
Project Halo and by the EU FP7 project LOD2 (Grant No. 257943).

 

Lots of thanks to 

   + Andreas Schultz and Andrea Matteini who did most of the implementation
and benchmarking work as well as

   + Christian Becker and Robert Isele who also contributed to the release.

 

Cheers,

 

Chris

 

 

--

Prof. Dr. Christian Bizer

Web-based Systems Group

Freie Universität Berlin

+49 30 838 55509

  http://www.bizer.de

  ch...@bizer.de

 



2nd CFP: Linked Data on the Web (LDOW2012) Workshop at WWW2012

2012-01-09 Thread Chris Bizer
=

 

Call for Papers:

 

Linked Data on the Web (LDOW2012)

 

Workshop at WWW2012

 

http://events.linkeddata.org/ldow2012/

 

=

 

16 April, 2012 Lyon, France

 

=

 

Objectives

 

The Web is continuing to develop from a medium for publishing textual

documents into a medium for sharing structured data. In 2011, the Web

of Linked Data grew to a size of about 32 billion RDF triples, with

contributions coming increasingly from companies, governments and

other public sector bodies such as libraries, statistical bodies or

environmental agencies. In parallel, Google, Yahoo and Bing have

established the schema.org initiative, a shared set of schemata for

publishing structured data on the Web that focuses on vocabulary

agreement and low barriers of entry for data publishers. These

developments create a positive feedback loop for data publishers and

highlight new opportunities for commercial exploitation of Web data.

 

In this context, the LDOW2012 workshop provides a forum for presenting

the latest research on Linked Data and driving forward the research

agenda in this area. We expect submissions that discuss the deployment

of Linked Data in different application domains and explore the

motivation, value proposition and business models behind these

deployments, especially in relation to complementary and alternative

techniques for data provision (e.g. Web APIs, Microdata, Microformats)

and proprietary data sharing platforms (e.g. Facebook, Twitter,

Flickr, LastFM).

 

=

 

Topics of Interest

 

Topics of interest for the LDOW2012 workshop include, but are not limited
to:

 

* Linked Data Deployment

 

* case studies of Linked Data deployment and value propositions in
different application domains

* application showcases including aggregators, search engines and
marketplaces for Linked Data

* business models for Linked Data publishing and consumption

* analyzing and profiling the Web of Data

 

* Linked Data and alternative Data Provisioning and Sharing Techniques

 

* comparison of Linked Data to alternative data provisioning and sharing
techniques

* implications and limitations of a public data commons on the Web
versus company-owned sharing platforms

* increasing the value of Schema.org and OpenGraphProtocol data through
data linking

 

* Linked Data Infrastructure

 

* crawling, caching and querying Linked Data on the Web

* linking algorithms and identity resolution

* Web data integration and data fusion

* Linked Data mining and data space profiling

* tracking provenance and usage of Linked Data

* evaluating quality and trustworthiness of Linked Data

* licensing issues in Linked Data publishing

* interface and interaction paradigms for Linked Data applications

* benchmarking Linked Data tools

 

=

 

Submissions

 

We seek the following kinds of submissions:

 

   1. Full scientific papers: up to 10 pages in ACM format

   2. Short scientific and position papers: up to 5 pages in ACM format

 

Submissions must be formatted using the ACM SIG template (as per the

WWW2012 Research Track) available at

. Please

note that the author list does not need to be anonymized, as we do not

operate a double-blind review process. Submissions will be peer

reviewed by at least three independent reviewers. Accepted papers will

be presented at the workshop and included in the workshop proceedings.

At least one author of each paper is expected to register for the

workshop and attend to present the paper.

 

Please submit papers via EasyChair at



 

=

 

Important Dates

 

* Submission deadline: 13 February, 2012, 23:59 CET

* Notification of acceptance: 7 March, 2012

* Camera-ready versions of accepted papers: 23 March, 2012

* Workshop date: 16 April, 2012

 

=

 

Organising Committee

 

* Christian Bizer, Freie Universität Berlin, Germany

* Tom Heath, Talis Systems Ltd, UK

* Tim Berners-Lee, MIT CSAIL, USA

* Michael Hausenblas, DERI, NUI Galway, Ireland

 

=

 

Programme Committee 

 

* Alexandre Passant, DERI, NUI Galway, Ireland 

* Andreas Harth, Karlsruhe Institute of Technology, Germany 

* Andreas Langegger, University of Linz, Austria 

* Andy Seaborne, Epimorphics, UK 

* Anja Jentzsch, Freie Universität Berlin, Germany 

* Axel-Cyrille Ngonga Ngomo, University of Leipzig, Germany 

* Bernhard Schandl, University of Vienna, Austria 

* Christopher Brewster, Aston Business School, UK 

* D

2nd CfP: Semantic Web Challenge 2011 (Open Track and Billion Triples Track)

2011-08-01 Thread Chris Bizer
Hi all,

 

there are less than 2 months left until the submission deadline of the
Semantic Web Challenge 2011 and we thus recirculate the Call for
Participation.

 

The Semantic Web Challenge is the premier event for demonstrating practical
progress towards achieving the vision of the Semantic Web. 
The Semantic Web Challenge 2011 will take place at the 10th International
Semantic Web Conference in Bonn, Germany on October 23-27. 

 

The submission deadline for the Semantic Web Challenge 2011  is 

 

September 30, 2011, 23:59 CET.

 

More information about the Semantic Web Challenge 2011 as well as about the
winners of the former challenges is found at

 

  http://challenge.semanticweb.org/

 

The Billion Triples Dataset can be downloaded from

 

http://km.aifb.kit.edu/projects/btc-2011/

 

The Call for Participation for the Semantic Web Challenge 2011 is found
below.

 

We are looking forward to your submissions and to having another exciting
challenge this year!

 

Best,

 

Diane and Chris

 

 

Call for Participation for the 9th Semantic Web Challenge 

Open Track and Billion Triples Track 

at the 10th International Semantic Web Conference ISWC 2011 
Bonn, Germany 
October 23-27, 2011 
  http://challenge.semanticweb.org/ 

Introduction

Submissions are now invited for the 9th annual Semantic Web Challenge, the
premier event for demonstrating practical progress towards achieving the
vision of the Semantic Web. The central idea of the Semantic Web is to
extend the current human-readable Web by encoding some of the semantics of
resources in a machine-processable form. Moving beyond syntax opens the door
to more advanced applications and functionality on the Web. Computers will
be better able to search, process, integrate and present the content of
these resources in a meaningful, intelligent manner. 

As the core technological building blocks are now in place, the next
challenge is to demonstrate the benefits of semantic technologies by
developing integrated, easy to use applications that can provide new levels
of Web functionality for end users on the Web or within enterprise settings.
Applications submitted should give evidence of clear practical value that
goes above and beyond what is possible with conventional web technologies
alone. 

As in previous years, the Semantic Web Challenge 2011 will consist of two
tracks: the Open Track and the Billion Triples Track. The key difference
between the two tracks is that the Billion Triples Track requires the
participants to make use of the data set that has been crawled from the Web
and is provided by the organizers. The Open Track has no such restrictions.
As before, the Challenge is open to everyone from industry and academia. The
authors of the best applications will be awarded prizes and featured
prominently at special sessions during the conference. 

The overall goal of this event is to advance our understanding of how
Semantic Web technologies can be exploited to produce useful applications
for the Web. Semantic Web applications should integrate, combine, and deduce
information from various sources to assist users in performing specific
tasks. 

Challenge Criteria

The Challenge is defined in terms of minimum requirements and additional
desirable features that submissions should exhibit. The minimum requirements
and the additional desirable features are listed below per track. 

Open Track

Minimal requirements

1.  The application has to be an end-user application, i.e. an
application that provides a practical value to general Web users or, if this
is not the case, at least to domain experts. 
2.  The information sources used 

*   should be under diverse ownership or control 
*   should be heterogeneous (syntactically, structurally, and
semantically), and 
*   should contain substantial quantities of real world data (i.e. not
toy examples). 

3.  The meaning of data has to play a central role. 

*   Meaning must be represented using Semantic Web technologies. 
*   Data must be manipulated/processed in interesting ways to derive
useful information and 
*   this semantic information processing has to play a central role in
achieving things that alternative technologies cannot do as well, or at all;


Additional Desirable Features 

In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions. 

*   The application provides an attractive and functional Web interface
(for human users) 
*   The application should be scalable (in terms of the amount of data
used and in terms of distributed components working together). Ideally, the
application should use all data that is currently published on the Semantic
Web. 
*   Rigorous evaluations have taken place that demonstrate the benefits
of semantic technologies, or validate the results obtained. 
*   Novelt

Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

2011-06-30 Thread Chris Bizer
Hi Ruben,

> The important thing here is that the R2R patterns can be generated from 
> regular RDFS and OWL constructs (because these have a well-defined
meaning!), 
> while the other way round is difficult and impossible in general.
> If your (or anyone else's) software needs a different representation, 
> why not create it from RDF documents that use those Semantic Web
foundations 
> instead of forcing people to write those instructions?

For simple mappings that can be expressed using standard terms like
owl:equivalentClass, owl:equivalentProperty or rdfs:subClassOf,
rdfs:subPropertyOf we don't force people to write R2R syntax.

The R2R framework understands these constructs and when loading mappings
from a file or the Web, the framework simple rewrites these standard terms
into the equivalent internal R2R representation.

So we build on the existing standards, but just decided that for complex
mappings that require structural transformations and value transformation
functions we prefer a graph pattern based syntax.

Best,

Chris


-Ursprüngliche Nachricht-
Von: Ruben Verborgh [mailto:ruben.verbo...@ugent.be] 
Gesendet: Freitag, 1. Juli 2011 07:49
An: Chris Bizer
Cc: 'public-lod'; 'Semantic Web'; semantic...@yahoogroups.com
Betreff: Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

Hi Chris,

Sounds like a challenge indeed :)
Thanks for bringing this to my attention.

While we have a lot of experience with reasoning, we never tried to go to
the billions. I contacted Jos De Roo, the author of the EYE reasoner, to see
what would be possible. I think we might at least be able to perform some
interesting stuff.

Note however that performance is a separate issue from what I was saying
before. No matter how good the LDIF Hadoop implementation will perform (and
I am curious to find out!), for me, it doesn't justify creating a whole new
semantics.
The important thing here is that the R2R patterns can be generated from
regular RDFS and OWL constructs (because these have a well-defined
meaning!), while the other way round is difficult and impossible in general.
If your (or anyone else's) software needs a different representation, why
not create it from RDF documents that use those Semantic Web foundations
instead of forcing people to write those instructions?
Reuse is so important in our community, and while software will someday be
able to bring a lot of data together, humans will always be responsible for
getting things right at the very base.

Cheers,

Ruben

On 30 Jun 2011, at 22:34, Chris Bizer wrote:

> Hi Ruben,
> 
>> Thanks for the fast and detailed reply, it's a very interesting
> discussion.
>> 
>> Indeed, there are several ways for mapping and identity resolution.
>> But what strikes me is that people in the community seem to be
> insufficiently aware 
>> of the possibilities and performance of current reasoners.
> 
> Possibly. But luckily we are today in the position to just give it a try.
> 
> So an idea with my Semantic Web Challenge hat on:
> 
> Why not take the Billion Triples 2011 data set
> (http://challenge.semanticweb.org/) which consists of 2 billion triples
that
> have been recently crawled from the Web and try to find all data in the
> dataset about authors and their publications, map this data to a single
> target schema and merge all duplicates.
> 
> Our current LDIF in-memory implementation is not capable of doing this as
2
> billion triples are too much data for it. But with the planned
Hadoop-based
> implementation we are hoping to get into this range.
> 
> It would be very interesting if somebody else would try to solve the task
> above using a reasoned-based approach and we could then compare the number
> of authors and publications identified as well as the duration of the data
> integration process.
> 
> Anybody interested?
> 
> Cheers,
> 
> Chris




Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

2011-06-30 Thread Chris Bizer
Hi Ruben,

> Thanks for the fast and detailed reply, it's a very interesting
discussion.
>
> Indeed, there are several ways for mapping and identity resolution.
> But what strikes me is that people in the community seem to be
insufficiently aware 
> of the possibilities and performance of current reasoners.

Possibly. But luckily we are today in the position to just give it a try.

So an idea with my Semantic Web Challenge hat on:

Why not take the Billion Triples 2011 data set
(http://challenge.semanticweb.org/) which consists of 2 billion triples that
have been recently crawled from the Web and try to find all data in the
dataset about authors and their publications, map this data to a single
target schema and merge all duplicates.

Our current LDIF in-memory implementation is not capable of doing this as 2
billion triples are too much data for it. But with the planned Hadoop-based
implementation we are hoping to get into this range.

It would be very interesting if somebody else would try to solve the task
above using a reasoned-based approach and we could then compare the number
of authors and publications identified as well as the duration of the data
integration process.

Anybody interested?

Cheers,

Chris


> As you can see the data translation requires lots of structural
> transformations as well as complex property value transformations using
> various functions. All things where current logical formalisms are not
very
> good at.

Oh yes, they are. All needed transformations in your paper can be performed
by at least two reasoners: cwm [1] and EYE [2] by using built-ins [3].
Include are regular expressions, datatype transforms.
Frankly, every transform in the R2R example can be expressed as an N3 rule.

> If I as a application developer
> want to get a job done, what does it help me if I can exchange mappings
> between different tools that all don't get the job done?

Because different tools can contribute different results, and if you use a
common language and idiom, they all can work with the same data and
metadata.

> more and more developers know SPARQL which makes it easier for them to
learn R2R.

The developers that know SPARQL is a proper subset of those that know plain
RDF, which is what I suggest using. And even if rules are necessary, N3 is
only a small extension of RDF.

> Benchmark we have the feeling that SPARQL engines are more suitable for
> this task then current reasoning engines due to their performance problems
> as well as problems to deal with inconsistent data. 

The extremely solid performance [4] of EYE is too little known. It can
achieve things in linear time that other reasoners can never solve.

But my main point is semantics. Why make a new system with its own meanings
and interpretations, when there is so much to do with plain RDF and its
widely known vocabularies (RDFS, OWL)?
Ironically, a tool which contributes to the reconciliation of different RDF
sources, does not use common vocabularies to express well-known
relationships.

Cheers,

Ruben

[1] http://www.w3.org/2000/10/swap/doc/cwm.html
[2] http://eulersharp.sourceforge.net/
[3] http://www.w3.org/2000/10/swap/doc/CwmBuiltins
[4] http://eulersharp.sourceforge.net/2003/03swap/dtb-2010.txt

On 30 Jun 2011, at 10:51, Chris Bizer wrote:

> Hi Ruben,
> 
> thank you for your detailed feedback.
> 
> Of course it is always a question of taste how you prefer to express data
> translation rules and I agree that simple mappings can also be expressed
> using standard OWL constructs.
> 
> When designing the R2R mapping language, we first analyzed the real-world
> requirements that arise if you try to properly integrate data from
existing
> Linked Data on the Web. We summarize our findings in Section 5 of the
> following paper
>
http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
> BizerSchultz-COLD-R2R-Paper.pdf
> As you can see the data translation requires lots of structural
> transformations as well as complex property value transformations using
> various functions. All things where current logical formalisms are not
very
> good at. 
> 
> Others reasons why we choose to base the mapping language on SPARQL where
> that:
> 
> 1. more and more developers know SPARQL which makes it easier for them to
> learn R2R.
> 2. we to be able to translate large amounts (billions of triples in the
> mid-term) of messy inconsistent Web data and from our experience with the
> BSBM Benchmark we have the feeling that SPARQL engines are more suitable
for
> this task then current reasoning engines due to their performance problems
> as well as problems to deal with inconsistent data. 
> 
> I disagree with you that R2R mappings are not suitable for being exchanged
> on the Web. In contrast they were especially designed for being published
> and discovered on the

Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

2011-06-30 Thread Chris Bizer
Hi Ruben,

thank you for your detailed feedback.

Of course it is always a question of taste how you prefer to express data
translation rules and I agree that simple mappings can also be expressed
using standard OWL constructs.

When designing the R2R mapping language, we first analyzed the real-world
requirements that arise if you try to properly integrate data from existing
Linked Data on the Web. We summarize our findings in Section 5 of the
following paper
http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
BizerSchultz-COLD-R2R-Paper.pdf
As you can see the data translation requires lots of structural
transformations as well as complex property value transformations using
various functions. All things where current logical formalisms are not very
good at. 

Others reasons why we choose to base the mapping language on SPARQL where
that:

1. more and more developers know SPARQL which makes it easier for them to
learn R2R.
2. we to be able to translate large amounts (billions of triples in the
mid-term) of messy inconsistent Web data and from our experience with the
BSBM Benchmark we have the feeling that SPARQL engines are more suitable for
this task then current reasoning engines due to their performance problems
as well as problems to deal with inconsistent data. 

I disagree with you that R2R mappings are not suitable for being exchanged
on the Web. In contrast they were especially designed for being published
and discovered on the Web and allow partial mappings from different sources
to be easily combined (see paper above for details about this).

I think your argument about the portability of mappings between different
tools currently is only partially valid. If I as a application developer
want to get a job done, what does it help me if I can exchange mappings
between different tools that all don't get the job done?

Also note, that we aim with LDIF to provide for identity resolution in
addition to schema mapping. It is well known that identity resolution in
practical setting requires rather complex matching heuristics (see Silk
papers for details about different matchers that are usually employed) and
identity resolution is again a topic where reasoning engines don't have too
much to offer.

But again, there are different ways and tastes about how to express mapping
rules and identity resolution heuristics. R2R and Silk LSL are our
approaches to getting the job done and we are of course happy if other
people provide working solutions for the task of integrating and cleansing
messy data from the Web of Linked Data and are happy to compare our approach
with theirs.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: Ruben Verborgh [mailto:ruben.verbo...@ugent.be] 
Gesendet: Donnerstag, 30. Juni 2011 10:04
An: Chris Bizer
Cc: 'public-lod'; 'Semantic Web'; semantic...@yahoogroups.com
Betreff: Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

Hi Chris,

I've taken a look at your work and it is certainly interesting.

However, I have a couple questions with regarding the approach you have
taken.
The example [1] shows that we need to create a specific mapping. But can we
call this "semantic"?
It is a configuration file which can only be understood by a specific tool.
It could as well have been XML or another format.
Why not choose to express the same things using existing, semantic
predicates, which can be understood by different tools and express actual
knowledge?
And why not rely on existing ontologies that express relations semantically,
and reuse portable knowledge?
Example:

mp:Gene
r2r:sourcePattern "?SUBJ a genes:gene";
r2r:targetPattern "?SUBJ a smwcat:Gene".

could be

genes:gene owl:sameAs smwcat:Gene.

Not only does this have universally accepted semantics, it is also portable
to different situations. For example:
_:specializedGene rdfs:subClassOf genes:gene.


Another thing is that I do not agree with the pattern literals.
If we take a look at such a pattern:

"?SUBJ a genes:gene",

we see there are a lot of implicit things here.
First, the prefix needs to be looked up using the r2r:prefixDefinitions
predicate. So a specific syntax (Turtle prefixes) is tied to a conceptual
model. I can imagine a lot of problems here. Clearly, r2r:prefixDefinitions
is some kind of functional property. But when are two prefixDefinitions the
same? Exact string comparison is not the answer.
But the bigger problem I'm having is with the variables. With the ?SUBJ
notation, you seem to add implicit support for universal quantification.
This last sentence clarifies the big issue: "implicit". Variables are
placeholders identified by a certain name in a certain scope, but the name
itself is unimportant.

Concretely, "?SUBJ a genes:gene" should mean the same as "?s a genes:gene".
Except that it doesn't.
Because now, "?SUBJ a smwcat:Gene" is n

ANN: LDIF - Linked Data Integration Framework V0.1 released.

2011-06-29 Thread Chris Bizer
Hi all,

 

we are happy to announce the initial release of the LDIF – Linked Data
Integration Framework today.

 

LDIF is a software component for building Linked Data applications which
translates heterogeneous Linked Data from the Web into 

a clean, local target representation while keeping track of data provenance.

 

Applications that consume Linked Data from the Web are confronted with the
following two challenges:

 

1. data sources use a wide range of different RDF vocabularies to represent
data about the same type of entity.

2. the same real-world entity, for instance a person or a place, is
identified with different URIs within different data sources. 

 

The usage of various vocabularies as well as the usage of URI aliases makes
it very cumbersome for an application developer to write for instance SPARQL
queries against Web data that originates from multiple sources. 

 

A successful approach to ease using Web data in the application context is
to translate heterogeneous data into a single local target vocabulary and to
replace URI aliases with a single target URI on the client side before
starting to ask SPARQL queries against the data. 

 

Up-till-now, there have not been any integrated tools available that help
application developers with these tasks. 

 

With LDIF, we try to fill this gap and provide an initial alpha version of
an open-source Linked Data Integration Framework that can be used by Linked
Data applications to translate Web data and normalize URI aliases.

 

For Identity resolution, LDIF builds on the Silk Link Discovery Framework.

For data translation, LDIF employs the R2R Mapping Framework. 



More information about LDIF and a concrete usage example is provided on the
LDIF website at

 

http://www4.wiwiss.fu-berlin.de/bizer/ldif/

 

Lots of thanks to 

 

Andreas Schultz (FUB) 

Andrea Matteini (MES) 

Robert Isele (FUB) 

Christian Becker (MES) 

 

for their great work on the framework.

 

Best,

 

Chris

 

 

Acknowledgments

 

The development of LIDF is supported in part by Vulcan Inc. as part of its
Project Halo and by the EU FP7 project LOD2 - Creating Knowledge out of
Interlinked Data (Grant No. 257943). 

 

--

Prof. Dr. Christian Bizer

Web-based Systems Group

Freie Universität Berlin

+49 30 838 55509

  http://www.bizer.de

  ch...@bizer.de

 



Re: Semantic Web Challenge 2011 CfP and Billion Triple Challenge 2011 Data Set published.

2011-06-17 Thread Chris Bizer
Hi Giovanni,

 

yes, it’s great that you and your team have provided the Sindice crawl as a
dump for TREC2011.

 

Now, the community has two large-scale datasets for experimentation. 

 

Your dataset that covers various types of structured data on the Web (RDFa,
WebAPIs, microformats …)  as well as the new Billion Triple Dataset that
focuses on data that is published according to the Linked Data principles. 

 

Our dataset is relatively current (May/June 2011) and we also still provide
the 2010 and 2009 versions of the dataset for download so that people can
analyze the evolution of the Web of Linked Data.

 

Your dataset covers the whole time span (2009-2011). Does the dataset
contain any meta-information about how old specific parts of the dataset are
so that people can also analyze the evolution? 

 

Let’s hope that Google, Yahoo or Micosoft will soon start providing an API
over the Schema.org data that they extract from webpages (or even provide
this data as a dump).

 

Then, the community would have three real-world datasets as a basis for
future research J

 

Cheers,

 

Chris

 

 

 

Von: g.tummare...@gmail.com [mailto:g.tummare...@gmail.com] Im Auftrag von
Giovanni Tummarello
Gesendet: Freitag, 17. Juni 2011 13:35
An: Chris Bizer
Cc: Semantic Web; public-lod; semantic...@yahoogroups.com
Betreff: Re: Semantic Web Challenge 2011 CfP and Billion Triple Challenge
2011 Data Set published.

 

 

This year, the Billion Triple Challenge data set consists of 2 billion
triples. The dataset was crawled during May/June 2011 using a random sample
of URIs from the BTC 2010 dataset as seed URIs. Lots of thanks to Andreas
Harth for all his effort put into crawling the web to compile this dataset,
and to the Karlsruher Institut für Technologie which provided the necessary
hardware for this labour-intensive task.

 

 

 

On a related note, 

 

 while nothing can beat a custom job obviously,

 

i feel like reminding that those that don't have said mighty
time/money/resources that any amount of data that one wants  rom the
repositories in Sindice which we do make freely available for things like
this. (0 to 20++ billion triples, LOD or non LOD, microformats, RDFa, custom
filtered etc)

 

See the  TREC 2011 competition
http://data.sindice.com/trec2011/download.html (1TB+ of data!)  or the
recent W3C data anaysis which is leading to a new reccomendation
(http://www.w3.org/2010/02/rdfa/profile/data/)  etc.

 

trying to help. 

Congrats on the great job guys of course for the Semantic web challenge
which is a long standing great initiative!

Gio



Semantic Web Challenge 2011 CfP and Billion Triple Challenge 2011 Data Set published.

2011-06-17 Thread Chris Bizer
Hi all,

 

we are happy to announce that the Billion Triples Challenge 2011 Data Set
has been published yesterday. 

 

We thus circulate the Call for Participation for the 9th Semantic Web
Challenge 2011 again. 

 

This year, the Billion Triple Challenge data set consists of 2 billion
triples. The dataset was crawled during May/June 2011 using a random sample
of URIs from the BTC 2010 dataset as seed URIs. Lots of thanks to Andreas
Harth for all his effort put into crawling the web to compile this dataset,
and to the Karlsruher Institut für Technologie which provided the necessary
hardware for this labour-intensive task.

 

The Semantic Web Challenge 2011 will take place at the 10th International
Semantic Web Conference in Bonn, Germany on October 23-27. We are looking
forward to receive your submissions until September 30, 2011, 23:59 CET.

 

More information about the Semantic Web Challenge 2011 as well as about the
former challenges is found at

 

  http://challenge.semanticweb.org/

 

Best,

 

Diane and Chris

 

 

Call for Participation for the 9th Semantic Web Challenge 

Open Track and Billion Triples Track 

at the 10th International Semantic Web Conference ISWC 2011 
Bonn, Germany 
October 23-27, 2011 
  http://challenge.semanticweb.org/ 

Introduction

Submissions are now invited for the 9th annual Semantic Web Challenge, the
premier event for demonstrating practical progress towards achieving the
vision of the Semantic Web. The central idea of the Semantic Web is to
extend the current human-readable Web by encoding some of the semantics of
resources in a machine-processable form. Moving beyond syntax opens the door
to more advanced applications and functionality on the Web. Computers will
be better able to search, process, integrate and present the content of
these resources in a meaningful, intelligent manner. 

As the core technological building blocks are now in place, the next
challenge is to demonstrate the benefits of semantic technologies by
developing integrated, easy to use applications that can provide new levels
of Web functionality for end users on the Web or within enterprise settings.
Applications submitted should give evidence of clear practical value that
goes above and beyond what is possible with conventional web technologies
alone. 

As in previous years, the Semantic Web Challenge 2011 will consist of two
tracks: the Open Track and the Billion Triples Track. The key difference
between the two tracks is that the Billion Triples Track requires the
participants to make use of the data set that has been crawled from the Web
and is provided by the organizers. The Open Track has no such restrictions.
As before, the Challenge is open to everyone from industry and academia. The
authors of the best applications will be awarded prizes and featured
prominently at special sessions during the conference. 

The overall goal of this event is to advance our understanding of how
Semantic Web technologies can be exploited to produce useful applications
for the Web. Semantic Web applications should integrate, combine, and deduce
information from various sources to assist users in performing specific
tasks. 

Challenge Criteria

The Challenge is defined in terms of minimum requirements and additional
desirable features that submissions should exhibit. The minimum requirements
and the additional desirable features are listed below per track. 

Open Track

Minimal requirements

1.  The application has to be an end-user application, i.e. an
application that provides a practical value to general Web users or, if this
is not the case, at least to domain experts. 
2.  The information sources used 

*   should be under diverse ownership or control 
*   should be heterogeneous (syntactically, structurally, and
semantically), and 
*   should contain substantial quantities of real world data (i.e. not
toy examples). 

3.  The meaning of data has to play a central role. 

*   Meaning must be represented using Semantic Web technologies. 
*   Data must be manipulated/processed in interesting ways to derive
useful information and 
*   this semantic information processing has to play a central role in
achieving things that alternative technologies cannot do as well, or at all;


Additional Desirable Features 

In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions. 

*   The application provides an attractive and functional Web interface
(for human users) 
*   The application should be scalable (in terms of the amount of data
used and in terms of distributed components working together). Ideally, the
application should use all data that is currently published on the Semantic
Web. 
*   Rigorous evaluations have taken place that demonstrate the benefits
of semantic technologies, or validate the results obtaine

ANN: Berlin SPARQL Benchmark Version 3 and Benchmarking Results

2011-02-22 Thread Chris Bizer
Hi all,

 

we are happy to announce Version 3 of the Berlin SPARQL Benchmark as well as
the results of a benchmark experiment in which we compared the query, load
and update performance of Virtuoso, Jena TDB, 4store, BigData, and BigOWLIM
using the new benchmark.

 

The Berlin SPARQL Benchmark Version 3 (BSBM V3) defines three different
query mixes that test different capabilities of RDF stores:

 

1.   The Explore query mix test the query performance with simple SPARQL
1.0.

2.   The Explore-and-Update query mix test the read and write
performance using SPARQL 1.0 SELECT queries as well as SPARQL 1.1 Update
queries.

3.   The Business Intelligence query mix consists of complex SPARQL 1.1
queries that rely on aggregation as well as subqueries and each touches
large parts of the test dataset.

 

The BSBM V3 specification is found at

 

http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20101129/

 

We also conducted a benchmark experiment in which we compared query, load
and update performance of Virtuoso, Jena TDB, 4store, BigData, and BigOWLIM
using the new benchmark.

 

We tested the stores with for 100 million triple and 200 million triple data
sets and ran the Explore as well as the Explore-And-Update query mixes. The
results of this experiment are found at

 

http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index
.html

 

It is interesting to see that:

 

1.   Virtuoso dominates the Explore use case for multiple clients.

2.   BigOwlim also shows good multi-client scaling behavior for the 100M
dataset.

3.   4store is the fastest store for the Explore-And-Update query mix.

4.   BigOwlim is able to load the 200m dataset in under 40 minutes,
which comes near the bulk load times of relational databases like MySQL.

5.   All stores that we have previously tested with BSBM V2 improved
their query performance and load times.

 

We also tried to run the Business Intelligence query mix against the stores.
BigData and 4store currently do not provide all SPARQL features that are
required to run the BI query mix. We thus tried to run the Business
Intelligence query mix only against Virtuoso, TDB and BigOwlim. Doing this,
we ran into several "technical problems" that prevented us from finishing
the tests and from reporting meaningful results. We thus decided to give the
store vendors more time to fix and optimize their stores and will run the BI
query mix experiment again in about four months (July 2011).

 

Thanks a lot to Orri Erling for his proposal to have the Business
Intelligence use case and initial queries for the query mix. Lots of thanks
also go to Ivan Mikhailov for his in-depth review of the Business
Intelligence query mix and for finding several bugs in the queries. We also
want to thank Peter Boncz and Hugh Williams for feedback on the new version
of the BSBM benchmark.

 

We want to thank the store vendors and implementers for helping us to setup
and configure their stores for the experiment. Lots of thanks to Andy
Seaborne, Ivan Mikhailov, Hugh Williams, Zdravko Tashev, Atanas Kiryakov,
Barry Bishop, Bryan Thompson,  Mike Personick and Steve Harris. 

 

The work on the BSBM Benchmark Version 3 is funded by the LOD2 - Creating
Knowledge out of Linked Data project (http://lod2.eu/). 

 

More information about the Berlin SPARQL Benchmark is found at 

 

http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

 

 

Cheers,

 

Andreas Schultz and Chris Bizer

 

 

--

Prof. Dr. Christian Bizer

Web-based Systems Group

Freie Universität Berlin

+49 30 838 55509

 <http://www.bizer.de> http://www.bizer.de

 <mailto:ch...@bizer.de> ch...@bizer.de

 



ANN: New book about Linked Data published

2011-02-17 Thread Chris Bizer
Hi all,

 

Tom Heath and I have been working on a book about Linked Data over the last
months. We are very happy to announce today that the PDF version of the book
is available from Morgan & Claypool Publishers.

 

The book gives an overview of the principles of Linked Data as well as the
Web of Data that has emerged through the application of these principles. It
discusses patterns for publishing Linked Data, describes deployed Linked
Data applications and examines their architecture.

 

The book is published by Morgan & Claypool in the series Synthesis Lectures
on the Semantic Web: Theory and Technology edited by James Hendler and Frank
van Harmelen. See:

 

http://www.morganclaypool.com/doi/abs/10.2200/S00334ED1V01Y201102WBE001

 

The PDF version of the book is currently accessible to members of
organizations that have licensed the Morgan & Claypool Synthesis Lectures
collection.  In addition, the PDF version of the book can be purchased for
30 US$ via the Morgan & Claypool website. 

 

Within the next two weeks, the print version of the book will be available
at Amazon. A bit later, the print version will be available via other
channels and can also be ordered directly from Morgan & Claypool. 

 

On March 1st, 2011, we will publish a free HTML version of the book at
http://linkeddatabook.com/

We are currently still busy with producing the HTML version, so please
excuse the delay. 

 

Please find the abstract and table of contents of the book below:

 

Abstract of the Book 

 

The World Wide Web has enabled the creation of a global information space
comprising linked documents. As the Web becomes ever more enmeshed with our
daily lives, there is a growing desire for direct access to raw data not
currently available on the Web or bound up in hypertext documents. Linked
Data provides a publishing paradigm in which not only documents, but also
data, can be a first class citizen of the Web, thereby enabling the
extension of the Web with a global data space based on open standards - the
Web of Data. In this Synthesis lecture we provide readers with a detailed
technical introduction to Linked Data. We begin by outlining the basic
principles of Linked Data, including coverage of relevant aspects of Web
architecture. The remainder of the text is based around two main themes -
the publication and consumption of Linked Data. Drawing on a practical
Linked Data scenario, we provide guidance and best practices on:
architectural approaches to publishing Linked Data; choosing URIs and
vocabularies to identify and describe resources; deciding what data to
return in a description of a resource on the Web; methods and frameworks for
automated linking of data sets; and testing and debugging approaches for
Linked Data deployments. We give an overview of existing Linked Data
applications and then examine the architectures that are used to consume
Linked Data from the Web, alongside existing tools and frameworks that
enable these. Readers can expect to gain a rich technical understanding of
Linked Data fundamentals, as the basis for application development, research
or further study. 

 

Table of Contents

 

1. Introduction 

 

1.1 The Data Deluge 

1.2 The Rationale for Linked Data

1.2.1 Structure Enables Sophisticated Processing 

1.2.2 Hyperlinks Connect Distributed Data 

1.3 From Data Islands to a Global Data Space 

 

2 Principles of Linked Data 

 

2.1 The Principles in a Nutshell 

2.2 Naming Things with URIs 

2.3 Making URIs Defererenceable 

2.3.1 303 URIs 

2.3.2 Hash URIs

2.3.3 Hash versus 303 

2.4 Providing Useful RDF Information

2.4.1 The RDF Data Model 

2.4.2 RDF Serialization Formats 

2.5 Including Links to other Things 

2.5.1 Relationship Links 

2.5.2 Identity Links 

2.5.3 Vocabulary Links 

2.6 Conclusions 

 

3 TheWeb of Data 

 

3.1 Bootstrapping theWeb of Data 

3.2 Topology of theWeb of Data 

3.2.1 Cross-Domain Data 

3.2.2 Geographic Data

3.2.3 Media Data 

3.2.4 Government Data 

3.2.5 Libraries and Education 

3.2.6 Life Sciences Data 

3.2.7 Retail and Commerce 

3.2.8 User Generated Content and Social Media 

3.3 Conclusions

 

4 Linked Data Design Considerations

 

4.1 Using URIs as Names for Things 

4.1.1 Minting HTTP URIs 

4.1.2 Guidelines for Creating Cool URIs 

4.1.3 Example URIs 

4.2 Describing Things with RDF

4.2.1 Literal Triples and Outgoing Links 

4.2.2 Incoming Links 

4.2.3 Triples that Describe Related Resources 

4.2.4 Triples that Describe the Description 

4.3 Publishing Data about Data 

4.3.1 Describing a Data Set 

4.3.2 Provenance Metadata 

4.3.3 Licenses,Waivers and Norms for Data 

4.4 Choosing and Using Vocabularies to Describe Data 

4.4.1 SKOS, RDFS and OWL

4.4.2 RDFS Basics 

4.4.3 A Little OWL 

4.4.4 Reusing Existing Terms

4.4.5 Selecting Vocabularies 

4.4.6 Defining Terms 

4.5 Making Links with RDF

4.5.1 Making Links within a Data Set 

4.5.2 Making Links with External Data Sources 

4.5.3 Setting RDF Links Manually 

4.5.4 Au

AW: ANN: DBpedia 3.6 released

2011-01-19 Thread Chris Bizer
Hi Antoine,

> I was wondering: do you keep older versions of the DBpedia datasets?

we do and they are all reachable from the DBpedia download page at

http://wiki.dbpedia.org/Downloads36

Just click on the "Older Versions" links.

Cheers,

Chris


> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Antoine Zimmermann
> Gesendet: Mittwoch, 19. Januar 2011 16:49
> An: Chris Bizer
> Cc: dbpedia-announceme...@lists.sourceforge.net; dbpedia-
> discuss...@lists.sourceforge.net; 'Semantic Web'; 'public-lod'
> Betreff: Re: ANN: DBpedia 3.6 released
> 
> Dear Chris and the DBpedia crew,
> 
> 
> As always, a new version of DBpedia is very good news for the Semantic
> Web and Linked Data.
> 
> I was wondering: do you keep older versions of the DBpedia datasets?
> If yes, would you allow people to download older versions for research
> purposes?
> 
> This would be very useful in order to study the dynamics of RDF data, or
> the dynamics of DBpedia itself. There are already papers on the dynamics
> of Wikipedia but I am not aware of corresponding work for DBPedia.
> 
> 
> Regards,
> AZ.
> 
> Le 17/01/2011 14:10, Chris Bizer a écrit :
>  > Hi all,
>  >
>  > we are happy to announce the release of DBpedia 3.6. The new release is
>  > based on Wikipedia dumps dating from October/November 2010.
>  >
>  > The new DBpedia dataset describes more than 3.5 million things, of
which
>  > 1.67 million are classified in a consistent ontology, including 364,000
>  > persons, 462,000 places, 99,000 music albums, 54,000 films, 16,500
video
>  > games, 148,000 organizations, 148,000 species and 5,200 diseases.
>  >
>  > The DBpedia dataset features labels and abstracts for 3.5 million
> things in
>  > up to 97 different languages; 1,850,000 links to images and 5,900,000
> links
>  > to external web pages; 6,500,000 external links into other RDF
> datasets, and
>  > 632,000 Wikipedia categories.
>  >
>  > The dataset consists of 672 million pieces of information (RDF
> triples) out
>  > of which 286 million were extracted from the English edition of
Wikipedia
>  > and 386 million were extracted from other language editions and links
to
>  > external datasets.
>  >
>  > Along with the release of the new datasets, we are happy to announce
the
>  > initial release of the DBpedia MappingTool
>  > (http://mappings.dbpedia.org/index.php/MappingTool): a graphical user
>  > interface to support the community in creating and editing mappings
> as well
>  > as the ontology.
>  >
>  > The new release provides the following improvements and changes
> compared to
>  > the DBpedia 3.5.1 release:
>  >
>  > 1. Improved DBpedia Ontology as well as improved Infobox mappings
> using
>  > http://mappings.dbpedia.org/.
>  >
>  > Furthermore, there are now also mappings in languages other than
> English.
>  > These improvements are largely due to collective work by the community.
>  > There are 13.8 million RDF statements based on mappings (11.1 million
in
>  > version 3.5.1). All this data is in the /ontology/ namespace. Note
> that this
>  > data is of much higher quality than the Raw Infobox data in the
> /property/
>  > namespace.
>  >
>  > Statistics of the mappings wiki on the date of release 3.6:
>  >
>  > + Mappings:
>  >   + English: 315 Infobox mappings (covers 1124 templates including
>  > redirects)
>  >   + Greek: 137 Infobox mappings (covers 192 templates including
>  > redirects)
>  >   + Hungarian: 111 Infobox mappings (covers 151 templates including
>  > redirects)
>  >   + Croatian: 36 Infobox mappings (covers 67 templates including
>  > redirects)
>  >   + German: 9 Infobox mappings
>  >   + Slovenian: 4 Infobox mappings
>  > + Ontology:
>  >   +  272 classes
>  > +  Properties:
>  >   + 629 object properties
>  >   + 706 datatype properties (they are all in the /datatype/
> namespace)
>  >
>  > 2.  Some commonly used property names changed.
>  >
>  > + Please see http://dbpedia.org/ChangeLog and
>  > http://dbpedia.org/Datasets/Properties to know which relations
> changed and
>  > update your applications accordingly!
>  >
>  > 3. New Datatypes for increased quality in mapping-based properties
>  >
>  > + xsd:positiveInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger,
>  > xsd:negativeInteger
>  >
>  > 4. Improved parsing coverage.
>  >
>  > + Parsing o

2nd CfP: Linked Data on the Web (LDOW2011) Workshop at WWW2011

2011-01-17 Thread Chris Bizer
Hi all,

 

this is just a quick reminder that the submission deadline of the 4th
International Workshop on Linked Data on

the Web (LDOW2011) at WWW2011, Hyderabad, India is approaching slowly. The
deadline is in about 3 weeks: 

 

Submission deadline: 8th February, 2011

 

The LDOW2011 website is now available at:

 <http://events.linkeddata.org/ldow2011/>
http://events.linkeddata.org/ldow2011/

 

Please find below the CfP and we are looking forward to see you at LDOW2011!

 

Cheers,

 

Chris Bizer

Tom Heath

Tim Berners-Lee

Michael Hausenblas

 

 

=

 

Linked Data on the Web (LDOW2011) Workshop at WWW2011

 

http://events.linkeddata.org/ldow2011/

 

=

 

Workshop date: March 29th, 2011, Hyderabad, India

 

Submission deadline: 8th February, 2011

 

=

 

Objectives

 

The Web has developed into a global information space consisting not just of
linked documents, but also of linked data. In 2010, we have seen significant
growth in the size of the Web of Data as well as in the number of
communities contributing to its creation. In addition, there is intensive
work on applications that consume Linked Data from the Web.

 

The LDOW2011 workshop in Hyderabad follows the LDOW2008 workshop in Beijing,
the LDOW2009 workshop in Madrid, and the LDOW2010 workshop in Raleigh. As
the previous workshops, LDOW2011 is open to cover all topics related to
Linked Data publication as well as consumption, including principled
research in the areas of user interfaces for the Web of Data as well as on
issues of quality, trust and provenance in Linked Data. We also expect to
see a number of submissions related to current areas of high Linked Data
activity, such as government transparency, life sciences and the media
industry. The goal of this workshop is to provide a forum for exposing high
quality, novel research and applications in these (and related) areas. In
addition, by bringing together researchers in this field, we expect the
event to further shape the ongoing Linked Data research agenda.

 

=

 

Topics of Interest

 

Topics of interest for the LDOW2011 workshop include, but are not limited
to:

 

1. Foundations of Linked Data 

* Web architecture and dataspace theory

* dataset dynamics and synchronisation

* analizing and profiling the Web of Data

 

2. Data Linking and Fusion

* entity consolidation and linking algorithms

* Web-based data integration and data fusion

* performance and scalability of integration architectures

 

3. Write-enabled Linked Data Web

* access authentication mechanisms for Linked Datasets (WebID, etc.)

* authorisation mechanisms for Linked Datasets (WebACL, etc.)

* enabling write-access to legacy data sources (Google APIs, Flickr API,
etc.)

 

4. Data Publishing

* publishing legacy data sources as Linked Data on the Web 

* cost-benefits of the 5 star LOD plan

 

5. Data Usage 

* tracking provenance of Linked Data

* evaluating quality and trustworthiness of Linked Data

* licensing issues in Linked Data publishing

* distributed query of Linked Data

* RDF-to-X, turning RDF to legacy data 

 

6. Interaction with the Web of Data

* approaches to visualise Linked Data 

* interacting with distributed Web data

* Linked Data browsers, indexer and search engines

 

 

 

=

 

Submissions

 

We seek three kinds of submissions:

 

1. Full technical papers: up to 10 pages in ACM format

2. Short technical and position papers: up to 5 pages in ACM format

3. Demo description: up to 2 pages in ACM format

 

Submissions must be formatted using the WWW2011 templates available via the
LDOW2011 homepage. We note that the author list does not need to be
anonymised, as we we do not have a double-blind review process in place.

 

Submissions will be peer reviewed by at least three independent reviewers.
Accepted papers will be presented at the workshop and included in the
workshop proceedings.

 

Please submit your paper via EasyChair at

http://www.easychair.org/conferences/?conf=ldow2011

 

=

 

Important Dates

 

Submission deadline: 8th February, 2011

Notification of acceptance: 25th February, 2011 

Camera ready versions of accepted papers: 10nd March, 2011

Workshop date: Tuesday, 28th or 29th March, 2011  

 

=

 

Organising Committee

 

Christian Bizer, Freie Universität Berlin, Germany

Tom Heath, Talis Systems Ltd, UK

Tim Berners-Lee, MIT CSAIL, USA

Michael Hausenblas, DERI, NUI Galway, Ireland

 

=

 

Contact

 

For further information about the workshop, please contact the

workshops chairs at ldow2011 [at] events [dot linkeddata [dot] org

 

=

 

 

 

--

Prof. Dr. Christian Bize

ANN: DBpedia 3.6 released

2011-01-17 Thread Chris Bizer
ta of the previous release and suggesting ways to increase quality and
quantity. Some results of his work were implemented in this release. 
+ Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece), Jimmy
O'Regan (Eolaistriu Technologies, Ireland), José Paulo Leal (University of
Porto, Portugal) for providing patches to improve the extraction framework. 
+ Jens Lehmann and Sören Auer (both Universität Leipzig, Germany) for
providing the new dataset via the DBpedia download server at Universität
Leipzig. 
+ Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the
dataset into the Virtuoso instance that serves the Linked Data view and
SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether
for providing the server infrastructure for DBpedia. 

The work on the new release was financially supported by: 

+ Neofonie GmbH, a Berlin-based company offering leading technologies in the
area of Web search, social media and mobile applications
(http://www.neofonie.de/). 
+ The European Commission through the project LOD2 - Creating Knowledge out
of Linked Data (http://lod2.eu/). 
+ Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/).
Vulcan Inc. creates and advances a variety of world-class endeavors and high
impact initiatives that change and improve the way we live, learn, do
business (http://www.vulcan.com/). 

More information about DBpedia is found at http://dbpedia.org/About 

Have fun with the new dataset! 

The whole DBpedia team also congratulates Wikipedia to its 10th Birthday
which was this weekend! 

Cheers, 

Chris Bizer 


--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de





Re: Quality Criteria for Linked Data sources

2010-12-16 Thread Chris Bizer
Dear Annika,

great work and a really nice fusion of the classic data quality criteria that 
one finds in the literature from databases with Linked Data specific aspects.

Three comments:

1. Your criteria seam to focus mainly on the publication of instance data and 
do not say too much about the schema level. The overall goal of Linked Data is 
to publish data in a self-descriptive way [1], which means that you should not 
only set links on instance level, but you should also set links on schema level 
relating terms from different vocabularies to each other. This especially 
applies when you use proprietary terms, which cannot always be avoided. Thus, 
maybe you still want to add some criteria about providing definitions for 
proprietary vocabulary terms and setting links between different vocabularies 
to your list.

2. Your criteria in the category content are only a subset of the usual 
content-oriented criteria in literature (for summaries see for instance 
[2][3]). I guess you had reasons not to include all, but maybe you want to 
check against these lists again.

3. If you want talk in your thesis about the compliance of existing data 
sources on the Web with the quality criteria, the statistics about the 
compliance with different publishing best practices in the State of the LOD 
Cloud document [4] could be a good starting point.

Please also circulate a link to your thesis on this list once you have finished 
it. It appears like this is going to be an interesting read :-)   

Cheers,

Chris

[1] http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html
[2] http://portal.acm.org/citation.cfm?id=1791545
[3] http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_2736
[4] http://www4.wiwiss.fu-berlin.de/lodcloud/state/



-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag 
von Annika Flemming
Gesendet: Mittwoch, 15. Dezember 2010 20:50
An: public-lod@w3.org
Betreff: Quality Criteria for Linked Data sources

Hi,
I'm a student at the Humboldt University of Berlin and I'm currently writing my 
diploma thesis under the supervision of Olaf Hartig. The aim of my thesis is to 
draw up a set of criteria to assess the quality of Linked Data sources. My 
findings include eleven criteria grouped into four categories. Each criterion 
includes a set of so-called indicators. These indicators constitute a 
measurable aspect of a criterion and, thus, allow for the assessment of the 
quality of a data source w.r.t the criteria.
I've written a summary of my findings, which can be accessed here:

http://sourceforge.net/apps/mediawiki/trdf/index.php?title=Quality_Criteria_for_Linked_Data_sources

To evaluate my findings, I decided to post this summary hoping to receive some 
feedback about the criteria and indicators I suggested. Moreover, I'd like to 
initiate a discussion about my findings, and about their applicability to a 
quality assessment of data sources.

Your comments might be included in my thesis, but I won't add any names.

A further summary will follow shortly, describing a formalism based on these 
criteria and its application to several data sources.

Thanks to everyone participating,
Annika
-- 
GRATIS! Movie-FLAT mit über 300 Videos. 
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome





Re: Schema Mappings (was Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices)

2010-10-23 Thread Chris Bizer
Hi Leigh and Enrico,

> Hi,
>
> On 22 October 2010 09:35, Chris Bizer  wrote:
>>> Anja has pointed to a wealth of openly
>>> available numbers (no pun intended), that have not been discussed at
all.
>> For
>>> example, only 7.5% of the data source provide a mapping of "proprietary
>>> vocabulary terms" to "other vocabulary terms". For anyone building
>>> applications to work with LOD, this is a real problem.
>>
>> Yes, this is also the figure that scared me most.
>
> This might be low for a good reason: people may be creating
> proprietary terms because they don't feel well served by existing
> vocabularies and hence defining mappings (or even just reusing terms)
> may be difficult or even impossible.

Yes, this is true in many cases and for a given point in time.

But altogether I think it is important to see web-scale data integration
more in an evolutionary fashion in which different factors play together
over time. 

In my opinion these factors are:

1. An increasing amount of people start to use existing vocabularies which
already solves the integration problem in some areas simply by agreement on
these vocabularies.
2. More and more instance data is becoming available on the Web, which makes
it easier to mine schema mappings using statistical methods.
3. Different groups in various areas want to contribute to solving the
integration problem and thus invest effort in manually aligning vocabularies
(for instance between different standards used in the libraries community or
for people and provenance related vocabularies within the W3C Social Web and
Provenance XGs).
4. The Web allows you to share mappings by publishing them as RDF. Thus many
different people and groups may provide small contributions (= hints) that
help to solve the problem in the long run.

My thinking on the topic was strongly influenced by the pay-as-you-go data
integration ideas developed by Alon Halevy and other people in the
dataspaces community. A cool paper on the topic is in my opinion:

Web-Scale Data Integration: You can afford to Pay as You Go. Madhavan, J.;
Cohen, S.; Dong, X.; Halevy, A.; Jeffery, S.; Ko, D.; Yu, C., CIDR (2007)
http://research.yahoo.com/files/paygo.pdf

describing a system that applies schema clustering in order to mine mappings
from Google Base and web table data and presents ideas on how you can deal
with the uncertainty that you introduce using ranking algorithms.

Other interesting papers in the area are:

Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data
integration
systems. Proceedings of the Conference on Management of Data, SIGMOD (2008)

Vaz Salles, M.A., Dittrich, J., Karakashian, S.K., Girard, O.R., Blunschi,
L.: iTrails: Payas-you-go Information Integration in Dataspaces. In:
Conference of Very large Data Bases
(VLDB 2007), 663-674 (2007)

Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: A new
abstraction
for information management. SIGMOD Record 34(4), pp. 27–33 (2005)

Hedeler, C., et al.: Dimensions of Dataspaces. In: Proceedings of the 26th
British National
Conference on Databases, pp. 55-66 (2009)

These guys always have the idea that mappings are added to a dataspace by
administrators or mined using a single, specific method.

What I think is interesting in the Web of Linked Data setting is that
mappings can be created and published by different parties to a single
global dataspace. Meaning that the necessary effort to create the mappings
can be divided between different parties. So pay-as-you-go might evolve into
somebody-pay-as-you-go :-)
But of course also meaning that the quality of mappings is becoming
increasingly uncertain and that the information consumer needs to assess the
quality of mappings and decide which ones it wants to use.

We are currently exploring this problem space and will present a paper about
publishing and discovering mappings on the Web of Linked Data at the COLD
workshop at ISWC 2010.

http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
BizerSchultz-COLD-R2R-Paper.pdf

Central ideas of the paper are that:
1. you identify mappings with URIs so that they can be interlinked from
vocabulary definitions or void dataset descriptions and so that client
applications as well as Web of Data search engines can discover them.
2. A client application which discovers data that is represented using terms
that are unknown to the application may search the Web for mappings, apply a
quality evaluation heuristic to decide which alternative mappings to use and
then apply the chosen
mappings to translate data to its local schema. 

> This also strikes me as an opportunity: someone could usefully build a
> service (perhaps built on facilities in Sindice) that aggregated
> schema information and provides tools for expressing simple mappings
> and equivalencies. It could fill a dual role: r

Re: Low Quality Data (was before Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices)

2010-10-22 Thread Chris Bizer
Hi Juan,

 

 

Martin and all,

 

Can somebody point me to papers or maybe give their definition of low
quality data when it comes to LOD. What is the criteria for data to be
considered low quality.

 

An overview about the literature on data quality can be found in my PhD,
including the different definitions of the term and the like .

 

See:

 

http://www.diss.fu-berlin.de/diss/servlets/MCRFileNodeServlet/FUDISS_derivat
e_2736/02_Chapter2-Information-Quality.pdf?hosts=

also

http://www.diss.fu-berlin.de/2007/217/indexe.html

 

All this is from 2008. Thus, I guess there will also be newer stuff around,
but the text should properly reflect the state-of-the-art back then.

 

Cheers,

 

Chris

 

 

Thanks


Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com



On Fri, Oct 22, 2010 at 9:01 AM, Martin Hepp
 wrote:

The Web of documents is an open system built on people agreeing on standards
and best practices.
Open system means in this context that everybody can publish content and
that there are no restrictions on the quality of the content.
This is in my opinion one of the central facts that made the Web successful.

+100


The same is true for the Web of Data. There obviously cannot be any
restrictions on what people can/should publish (including, different
opinions on a topic, but also including pure SPAM). As on the classic Web,
it is a job of the information/data consumer to figure out which data it
wants to believe and use (definition of information quality = usefulness of
information, which is a subjective thing).
+100

 

The fact that there is obviously a lot of low quality data on the current
Web should not encourage us to publish masses of low-quality data and then
celebrate ourselves for having achieved a lot. The current Web tolerates
buggy markup, broken links, and questionable content of all types. But I
hope everybody agrees that the Web is successful because of this tolerance,
not because of the buggy content itself. Quite to the contrary, the Web has
been broadly adopted because of the lots of commonly agreed high-quality
contents.

If you continue to live the linked data landfill style it will fall back on
you, reputation-wise, funding-wise, and career-wise. Some rules hold in
ecosystems of all kinds and sizes.

Best

Martin

 



AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices

2010-10-22 Thread Chris Bizer
Hi Martin,

> The fact that there is obviously a lot of low quality data on the
> current Web should not encourage us to publish masses of low-quality
> data and then celebrate ourselves for having achieved a lot. The
> current Web tolerates buggy markup, broken links, and questionable
> content of all types. But I hope everybody agrees that the Web is
> successful because of this tolerance, not because of the buggy content
> itself. Quite to the contrary, the Web has been broadly adopted
> because of the lots of commonly agreed high-quality contents.

Sure, where is the problem? 

The same holds for the Web of Data: There is a lot of high quality content
and a lot of low quality content.
Which means - as on the classic Web - that the data consumer need to decide
which content it wants to use.

If the Web has proved anything than that having a completely open
architecture is a crucial factor for being able to succeed on global scale. 
The Web of Linked Data also aims at global scale. Thus, I will keep on
betting on open solutions without curation or any other bottle neck. 

> If you continue to live the linked data landfill style it will fall
> back on you, reputation-wise, funding-wise, and career-wise. Some
> rules hold in ecosystems of all kinds and sizes.

Sorry, you are leaving the grounds of scientific discussion here and I will
thus not comment.

Best,

Chris


> Best
> 
> Martin





AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices

2010-10-22 Thread Chris Bizer
Hi Denny,

thank you for your smart and insightful comments.

> I also find it a shame, that this thread has been hijacked, especially
since the
> original topic was so interesting. The original email by Anja was not
about the
> LOD cloud, but rather about -- as the title of the thread still suggests
-- the
> compliance of LOD with some best practices. Instead of the question "is X
in
> the diagram", I would much rather see a discussion on "are the selected
> quality criteria good criteria? why are some of them so little followed?
how
> can we improve the situation?" 

Absolutely. Opening up the discussion on these topics is exactly the reason
why we compiled the statistics.

In order to guide the discussion back to this topic, maybe it is useful to
repost the original link:

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

A quick initial comment concerning the term "quality criteria". I think it
is essential to distinguish between:

1. The quality of the way data is published, meaning to which extend the
publishers comply with best practices (a possible set of best practices is
listed in the document)
2. The quality of the data itself. I think Enrico's comment was going into
this direction.

The Web of documents is an open system built on people agreeing on standards
and best practices.
Open system means in this context that everybody can publish content and
that there are no restrictions on the quality of the content.
This is in my opinion one of the central facts that made the Web successful.

The same is true for the Web of Data. There obviously cannot be any
restrictions on what people can/should publish (including, different
opinions on a topic, but also including pure SPAM). As on the classic Web,
it is a job of the information/data consumer to figure out which data it
wants to believe and use (definition of information quality = usefulness of
information, which is a subjective thing). 

Thus it also does not make sense to discuss the "objective quality" of the
data that should be included into the LOD cloud (objective quality just does
not exist) and it makes much more sense to discuss the mayor issues that we
are still having in regard to the compliance with publishing best practices.

> Anja has pointed to a wealth of openly
> available numbers (no pun intended), that have not been discussed at all.
For
> example, only 7.5% of the data source provide a mapping of "proprietary
> vocabulary terms" to "other vocabulary terms". For anyone building
> applications to work with LOD, this is a real problem.

Yes, this is also the figure that scared me most.

> but in order to figure out what really needs to be done, and
> how the criteria for good data on the Semantic Web need to look like, we
> need to get back to Anja's original questions. I think that is a question
we
> may try to tackle in Shanghai in some form, I at least would find that an
> interesting topic.

Same with me. 
Shanghai was also the reason for the timing of the post.

Cheers,

Chris

> -Ursprüngliche Nachricht-
> Von: semantic-web-requ...@w3.org [mailto:semantic-web-
> requ...@w3.org] Im Auftrag von Denny Vrandecic
> Gesendet: Freitag, 22. Oktober 2010 08:44
> An: Martin Hepp
> Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas
Steiner;
> Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu
> d'Aquin
> Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
> practices
> 
> I usually dislike to comment on such discussions, as I don't find them
> particularly productive,  but 1) since the number of people pointing me to
> this thread is growing, 2) it contains some wrong statements, and 3) I
feel
> that this thread has been hijacked from a topic that I consider productive
and
> important, I hope you won't mind me giving a comment. I wanted to keep it
> brief, but I failed.
> 
> Let's start with the wrong statements:
> 
> First, although I take responsibility as a co-creator for Linked Open
Numbers,
> I surely cannot take full credit for it. The dataset was a shared effort
by a
> number of people in Karlsruhe over a few days, and thus calling the whole
> thing "Denny's numbers dataset" is simply wrong due to the effort spent by
> my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset"
or simply
> Linked Open Numbers, but providing me with the sole attribution is too
> much of an honor.
> 
> Second, although it is claimed that Linked Open Numbers are "by design and
> known to everybody in the core community, not data but noise", being one
> of the co-designers of the system I have to disagree. It is "noise by
design".
> One of my motivations for LON was to raise a few points for d

AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices

2010-10-21 Thread Chris Bizer
Hi Martin, Thomas and Kingsley,

> >> First, I think it is pretty funny that you list Denny's April's fool
dataset of
> creating triples for numbers as an acceptable part of the cloud,

Why? 

As I said, we are including all datasets which fulfill the minimal technical
requirements.
As Denny's dataset does this it is included. The same would of course be
true for BestBuy and other GoodRelations datasets, if they would be
connected by RDF links to other datasets in the cloud. 

> >> The fundamental mistake of what you say is that linked open e-commerce
> data is not "a dataset" but a wealth of smaller datasets. Asking me to
create
> CKAN entries for each store or business in the world that provides
> GoodRelations data is as if Google was asking any site owner in the world
to
> register his or her site manually via CKAN.
> >>
> >> That is 1990s style and does not have anything to do with a "Web" of
data.

I agree with you that it would be much better, if somebody would set up a
crawler, properly crawl the Web of Data and then provide a catalog about all
datasets. As long as nobody does this, I think it is useful to have the
manually maintained CKAN catalog as a first step. 

An interesting step into this direction if the profiling work done by Felix
Naumann's group for the BTC dataset. See
http://www.cs.vu.nl/~pmika/swc/submissions/swc2010_submission_3.pdf

> >> Is HTML + RDFa with hash fragments, available via HTTP GET
> "dereferencable" for you? E.g.

Absolutely!

> >> To be frank, I think the bubbles diagram fundamentally misses the point
in
> the sense that the power of linked data is in integrating a huge amount of
> small, specific data sources, and not in linking a manually maintained
blend of
> ca. 100 monolithic datasets.

Valid point.  I agree with you that the power of the Linked Data
architecture is that it provides for building a single global dataspace
which of course may contain small as well as big data sources.
 
The  goal of the LOD diagram is not to visualize any small chunk of RDF on
the Web, as this would be impossible for obvious reasons - including the
size of your screen. 

We restrict the diagram to bigger datasets, hoping that these may be
especially relevant to data consumers.

Of course, you may disagree with this restriction.

>From Thomas:
> > How about handling GoodRelations the same way as FOAF, representing it
> > as a somewhat existing bubble without exactly specifying where it
> > links to and from where inbound links come from

We also don't do this for FOAF anymore in the new version of the diagram.

>From Thomas:
> > In the end, the idea of a Web catalogue was mostly abandoned at some
> > point due to being unmanageable, maybe the same happens to the Web
> > /data/ "catalogue", aka. LOD cloud (the metaphor doesn't work
> > perfectly, but you get the point).

Yes. But I personally think that the Yahoo catalog was rather useful in the
early days of the Web.

In the same way, I think that the CKAN catalog is rather useful in the
current development stage of the Web of Data and I'm looking forward to the
time, when the Web of Data has grown to a point where such a catalog becomes
unmanageable.

But again: I agree that crawling the Web of Data and then deriving a dataset
catalog as well as meta-data about the datasets directly from the crawled
data would be clearly preferable and would also scale way better.

Thus: Could please somebody start a crawler and build such a catalog?

As long as nobody does this, I will keep on using CKAN.

Cheers,

Chris
 

> -Ursprüngliche Nachricht-
> Von: semantic-web-requ...@w3.org [mailto:semantic-web-
> requ...@w3.org] Im Auftrag von Kingsley Idehen
> Gesendet: Mittwoch, 20. Oktober 2010 20:30
> An: Thomas Steiner
> Cc: Martin Hepp; Chris Bizer; Semantic Web; public-lod@w3.org; Anja
> Jentzsch; semantic...@yahoogroups.com
> Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
> practices
> 
> On 10/20/10 2:13 PM, Thomas Steiner wrote:
> > Hi all,
> >
> > How about handling GoodRelations the same way as FOAF, representing it
> > as a somewhat existing bubble without exactly specifying where it
> > links to and from where inbound links come from (on the road right
> > now, so can't check for sure whether it is already done this way)? The
> > individual datasets are too small to be entered manually into CKAN (+1
> > for Martin's arguments here).
> > In the end, the idea of a Web catalogue was mostly abandoned at some
> > point due to being unmanageable, maybe the same happens to the Web
> > /data/ "catalogue", aka. LOD cloud (the metaphor doesn't work
> > perfectly, but you get the poin

Semantic Web Challenge @ ISWC2010 (Submission deadline reminder)

2010-09-23 Thread Chris Bizer
Dear all,

this is a reminder that the submission deadline for the Semantic Web
Challenge 2010 is quickly approaching. The submission deadline is

Next Friday, October 1st, 2010, 12 a.m. (midnight) CET

The Semantic Web Challenge 2010 is collocated with the 9th International
Semantic Web Conference (ISWC2010) in Shanghai, China. As last year, the
challenge consists of two tacks: The Open Track and the Billion Triples
Track, which requires participants to make use of the data set that has been
crawled from the public Semantic Web. The data set consists of 3.2 billion
triples this year and can be downloaded from the challenge's website.   

The Call for Participation is found below. More information about the
Challenge is provided at

http://challenge.semanticweb.org/

We are looking forward to your submissions which as we hope will make the
Semantic Web Challenge again one of the most exciting events at ISWC.

Best regards,

Diana Maynard and Chris Bizer



--

Call for Participation for the 

8th Semantic Web Challenge 

at the 9th International Semantic Web Conference ISWC 2010 
Shanghai, China, November 7-11, 2010 

http://challenge.semanticweb.org/

--

Introduction

Submissions are now invited for the 8th annual Semantic Web Challenge, the
premier event for demonstrating practical progress towards achieving the
vision of the Semantic Web. The central idea of the Semantic Web is to
extend the current human-readable Web by encoding some of the semantics of
resources in a machine-processable form. Moving beyond syntax opens the door
to more advanced applications and functionality on the Web. Computers will
be better able to search, process, integrate and present the content of
these resources in a meaningful, intelligent manner. 

As the core technological building blocks are now in place, the next
challenge is to demonstrate the benefits of semantic technologies by
developing integrated, easy to use applications that can provide new levels
of Web functionality for end users on the Web or within enterprise settings.
Applications submitted should give evidence of clear practical value that
goes above and beyond what is possible with conventional web technologies
alone. 

As in previous years, the Semantic Web Challenge 2010 will consist of two
tracks: the Open Track and the Billion Triples Track. The key difference
between the two tracks is that the Billion Triples Track requires the
participants to make use of the data set (consisting of 3.2 billion triples
this year) that has been crawled from the Web and is provided by the
organizers. The Open Track has no such restrictions. As before, the
Challenge is open to everyone from industry and academia. The authors of the
best applications will be awarded prizes and featured prominently at special
sessions during the conference. 

The overall goal of this event is to advance our understanding of how
Semantic Web technologies can be exploited to produce useful applications
for the Web. Semantic Web applications should integrate, combine, and deduce
information from various sources to assist users in performing specific
tasks. 

---
Challenge Criteria

The Challenge is defined in terms of minimum requirements and additional
desirable features that submissions should exhibit. The minimum requirements
and the additional desirable features are listed below per track. 

Open Track

Minimal requirements

1. The application has to be an end-user application, i.e. an application
that provides a practical value to general Web users or, if this is not the
case, at least to domain experts. 
2. The information sources used should be under diverse ownership or control
should be heterogeneous (syntactically, structurally, and semantically), and
should contain substantial quantities of real world data (i.e. not toy
examples). The meaning of data has to play a central role. 
3. Meaning must be represented using Semantic Web technologies. 
4. Data must be manipulated/processed in interesting ways to derive useful
information and this semantic information processing has to play a central
role in achieving things that alternative technologies cannot do as well, or
at all; 

Additional Desirable Features 

In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions. 

1. The application provides an attractive and functional Web interface (for
human users) 
2. The application should be scalable (in terms of the amount of data used
and in terms of distributed components working together). Ideally, the
application should use all data that is currently published on the Semantic
Web. 
3. Rigorous evaluations have taken place that demonstrate the benefits of
semantic technologies, or validate the results obtained. 
4. Novelty, in applying semantic technology to a domain or task that

New LOD ESW wikipage about Data Licensing

2010-09-19 Thread Chris Bizer
Hi all,

as the Web of Linked Data is moving towards more serious applications,
putting published data under a proper license is becoming more and more
important. If no license is specified, people cannot use published data
within any serious applications.

Thus, I have stated a new LOD ESW wiki page that collects information about 

1. existing data licenses 
2. best practices on how to annotate Linked data with licensing
meta-information

The page is found at 

http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataLicensing
#Data_Licensing

I have added the things I know about. But I'm sure there are many other
important resources about the topic.

So if you know about any, please add them to the wiki so that other people
can find the relevant pointers.

Thank you very much in advance.

Best,

Chris





AW: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included.

2010-09-05 Thread Chris Bizer
Hi Alan,

> I have just spent some time evaluating one source and reported to you 
> the result. Perhaps you might act on this investment in time and thank 
> me for doing so. You might find that the result was myself and more 
> people doing such quality control.

Sorry that my reply yesterday might have been a bit too harsh.

I have looked up the CAS license (http://www.cas.org/legal/infopolicy.html)
and added a reference to the description of the CAS dataset at

http://ckan.net/package/bio2rdf-cas

Please also note that CKAN provides a rating function for the datasets and
also provides for commenting and discussing the datasets.

Maybe people could use these features as a start to collect quality-related
meta-information about the datasets.

CKAN also provides a link to the http://www.isitopendata.org/ service, which
might be used for license inquiries.

I agree with you that the quality of Linked Data published on the Web is
crucial, but we also have to take into account that much of the data in the
LOD cloud is currently still published by research projects in order to
demonstrate the technologies.

As the Web of Data is evolving and more and more actual owners of the
datasets start to provide them as Linked Data, I hope that the quality will
also increase and the datasets will be keep current. Encouraging
developments into this direction currently happen in the libraries,
eGovernment, and eCommerce domains. 

On the other hand, the Web is an open system and we will thus always see
people publishing low-quality, wrong and misleading data. Google handles
this fact rather successfully using PageRank. As the Web of Data provides
more structure then the classic Web, I think we might even be able to apply
more sophisticated data-quality assessment heuristics to decide which data
we want to use in our applications and which to ignore. Some of these
methods are listed in [1].

Best, 

Chris 

[1] Christian Bizer, Richard Cyganiak: Quality-driven information filtering
using the WIQA policy framework. Journal of Web Semantics: Science, Services
and Agents on the World Wide Web, Volume 7, Issue 1, January 2009, Pages
1-10.
http://dx.doi.org/10.1016/j.websem.2008.02.005


-Ursprüngliche Nachricht-
Von: Alan Ruttenberg [mailto:alanruttenb...@gmail.com] 
Gesendet: Samstag, 4. September 2010 22:20
An: Chris Bizer
Cc: Anja Jentzsch; public-lod@w3.org; Leigh Dodds; Jonathan Gray
Betreff: Re: Next version of the LOD cloud diagram. Please provide input, so
that your dataset is included.

On Sat, Sep 4, 2010 at 3:43 PM, Chris Bizer  wrote:
> So rather than to criticize the work that other people do on collecting
> meta-information about the datasets in the LOD cloud

Did you read what I wrote? I made no comment on the adequacy of
metainformation. In fact I *used* that metainformation to point out
that the data source in question did not satisfy the "open" provision
of linked *open* data. In addition I criticized the *inclusion* of the
data set in the *lod cloud diagram* because of this lack of openness
and because the actual content of that resource didn't resemble any
data in the resource that it was derived from (a registry of
information about chemical compounds), suggesting that it would hurt
the LOD effort as inclusion would be a kind of "false advertising".

-Alan




AW: AW: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included.

2010-09-05 Thread Chris Bizer
Hi Tim,

> Swoogle has such metadata for the documents it has indexed.  
> Perhaps we can extract and publish statistics for the key LOD datasets.

This would be great!

Chris

-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Tim Finin
Gesendet: Samstag, 4. September 2010 22:19
An: public-lod@w3.org
Betreff: Re: AW: Next version of the LOD cloud diagram. Please provide
input, so that your dataset is included.

On 9/4/10 4:01 PM, Chris Bizer wrote:
> But I guess there are also limits to the meta-data that people can gather
> manually. So the best would be if somebody would run a crawler and extract
> meta-data about vocabulary usage and other usage pattern directly from the
> LOD datasets. Nobody has done this yet but hopefully somebody will soon
> start doing this.

Swoogle has such metadata for the documents it has indexed.  Perhaps we
can extract and publish statistics for the key LOD datasets.





AW: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included.

2010-09-04 Thread Chris Bizer
Hi Dan,

> 
> This is great! Glad to see this being updated :)
>
> One thing I would love in the next revision is for FOAF to also be
> presented as a vocabulary, rather than as if it were itself a distinct
> dataset. While there are databases that expose as FOAF (LiveJournal
> etc.), and also a reasonable number of independently published 'FOAF
> files', the technical core of FOAF is really the vocabulary and the
> habit of linking things together. Having a FOAF 'blob' is great and
> all, but it doesn't help people understand that FOAF is used as a
> vocabulary by various of the other blobs too. 

Yes, we also felt that having a blob is a bit misleading and were thus
thinking about using a cloud icon for FOAF and SIOC to reflect the fact that
the blob actually consists of many separate files on many different servers.


Beside, we have started to tag datsets in CKAN with the vocabularies that
they use. So, ideally all datasets that use FOAF should be tagged with
format-foaf and people can use this data via the CKAN API to draw any
visualization of the LOD cloud they like.

> And beyond FOAF, I'm
> wondering how we can visually represent the use of eg. Music Ontology,
> or Dublin Core, or Creative Commons vocabularies across different
> regions of the cloud. Maybe (later :) someone could make a view where
> each blob is a pie-chart showing which vocabularies it uses?

Interesting idea. I would also love to see this.

Maybe we can give it a try, otherwise of course everybody is invited to get
the data from the CKAN API and visualize it in any way he thinks is
interesting.

> As a vocabulary manager, it is pretty hard to understand the costs and
> benefits of possible changes to a widely deployed RDF vocabulary. I'm
> sure I'm not alone in this; Tom (cc:'d) I expect would vouch the same
> regarding the Dublin Core terms. So if there could be some view of the
> new cloud diagram that showed us which blobs (er, datasets) used which
> vocabulary (and which terms), that would be really wonderful. On the
> Dublin Core side, it would be fascinating to see which datasets are
> using http://purl.org/dc/elements/1.1/ and which are using
> http://purl.org/dc/terms/ (and which are using both). Similarly with
> FOAF, I'd like to understand common deployment patterns better.  I
> expect other vocab managers and dataset publishersare in a similar
> situation, and would appreciate a map of the wider territory, so they
> know how to fit in with trends and conventions, or what missing pieces
> of vocabulary might need more work...

Yes, having data about usage patterns would be great. 

But I guess there are also limits to the meta-data that people can gather
manually. So the best would be if somebody would run a crawler and extract
meta-data about vocabulary usage and other usage pattern directly from the
LOD datasets. Nobody has done this yet but hopefully somebody will soon
start doing this.

Cheers,

Chris


> Thanks for any thoughts,
>
> Dan




AW: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included.

2010-09-04 Thread Chris Bizer
Hi Alan,

> I think you should consider having some better quality control

and

> Yes, unfortunate. A similar audit should be done for the sets 
> that are named on the LOD (also "open") cloud.

LOD is an open community effort to which everybody can contribute.

So rather than to criticize the work that other people do on collecting
meta-information about the datasets in the LOD cloud, you are more than
welcome to quality-control 20 billion triples.

Best,

Chris
  

-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Alan Ruttenberg
Gesendet: Samstag, 4. September 2010 18:47
An: Anja Jentzsch
Cc: public-lod@w3.org; Leigh Dodds; Chris Bizer; Jonathan Gray
Betreff: Re: Next version of the LOD cloud diagram. Please provide input, so
that your dataset is included.

On Sat, Sep 4, 2010 at 8:35 AM, Anja Jentzsch  wrote:
> Hi Alan,
>
> CKAN is a repository for all kinds of datasets. Even if datasets are not
open or only for non-commercial use, they can be listed and information on
licensing can be noted (Other - Closed, e.g.). This is still a valuable
information.

Hello Anja,

My comment was not a commentary on CKAN, it was a comment on specific
data set and it's relation to the LOD cloud - please have a closer
read.

However, now that you mention it, the opening line on the CKAN website
says: "CKAN is a registry of open data and content packages." The
words "open data and content" are linked to
http://www.opendefinition.org/ which explains what open means (it does
not mean closed).

So one of two things should be fixed with CKAN - either the statement
on the front page should be changed to make it clear that it also
registers closed data, or the closed data entries should be expunged.

> If no license is specified or we did not find the license information,
CKAN lists the datasets as "not open".

Same comment re: having CKAN present a consistent view of what it does.

> Leigh Dodds had a closer look at the licenses of the LOD datasets some
time ago [1]. It is sad but true that only about 23% of all datasets come
along with a clearly defined license.

Yes, unfortunate. A similar audit should be done for the sets that are
named on the LOD (also "open") cloud.

> Hopefully data publishers will more clearly state the licenses along with
their datasets to encourage people to use their data.

Here we agree, and part of my work is doing exactly that.

Regards,
Alan

>
> Cheers,
> Anja
>
> [1]
http://iswc2009.semanticweb.org/wiki/index.php/ISWC_2009_Tutorials/Legal_and
_Social_Frameworks_for_Sharing_Data_on_the_Web#Slides
>
> On 03.09.2010 20:43, Alan Ruttenberg wrote:
>> I think you should consider having some better quality control and
>> standards around this, as I feel it is somewhat misleading. For
>> example (and this is one of several), consider CAS which is named in
>> the diagram. I don't consider the contents of that set to include any
>> data. Here is an example:
>>
>> http://cu.bio2rdf.org/cas:921-60-8
>>
>> Subject
>> http://bio2rdf.org/cas:921-60-8
>>
>> Predicate     Object
>> http://bio2rdf.org/bio2rdf_resource:url      
http://bio2rdf.org/html/cas:921-60-8
>> (Non-RDF URI)
>> http://www.w3.org/2002/07/owl#sameAs  http://cas.bio2rdf.org/cas:921-60-8
>> (External link)
>>
>> This is content free.
>>
>> In addition, the documentation of that set says it is not open:
>> http://ckan.net/package/bio2rdf-cas
>>
>> Although this URI might be used to link somehow, in my opinion it is
>> misleading to call this collection a linked open *data* set. Further,
>> including it will do damage to LOD reputation if anyone actually looks
>> past that diagram to see what is really there.
>>
>> Sincerely,
>>
>> Alan Ruttenberg
>>
>>
>> On Fri, Sep 3, 2010 at 2:00 PM, Jonathan Gray
 wrote:
>>> FYI, we blogged this here:
>>>
>>>
 http://blog.okfn.org/2010/09/03/next-version-of-the-linked-open-data-cloud-
based-on-ckan/
>>>
>>> All are, of course, most welcone to join ckan-discuss list if there
>>> are any specific suggestions for features we should add:
>>>
>>>  http://lists.okfn.org/mailman/listinfo/ckan-discuss
>>>
>>> We will be continuing to develop CKAN's support for LOD/semantic web
>>> technologies over the coming months (and years)! ;-)
>>>
>>> On Fri, Sep 3, 2010 at 5:03 PM, Leigh Dodds
 wrote:
>>>> Hi Chris, Anja
>>>>
>>>> On 3 September 2010 15:17, Chris Bizer  wrote:
>>>>> In theory, the list is automatically updated with data from CKAN.
>>>>>
>>>&g

AW: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included.

2010-09-03 Thread Chris Bizer
Hi Leigh,

In theory, the list is automatically updated with data from CKAN.

But as the CKAN server is overloaded today, the list is currently corrupted
and only shows a fraction of the datasets.

We hope that the issue is solved in the next hours!

Cheers,

Chris
 

> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Leigh Dodds
> Gesendet: Freitag, 3. September 2010 16:10
> An: Anja Jentzsch
> Cc: public-lod@w3.org
> Betreff: Re: Next version of the LOD cloud diagram. Please provide input,
> so that your dataset is included.
> 
> Hi,
> 
> > The list of datasets about which we have already collected information
> > is be found here:
> >
> > http://www4.wiwiss.fu-berlin.de/lodcloud/
> 
> Is that page manually maintained or is it derived from the data in CKAN?
> 
> For example I've just added the missing data to my NASA dataset,
> including notes on how it links to dbpedia. This should ensure there's
> enough links to get it onto the diagram. However I'm not seeing the
> page update, so assume its manual.
> 
> Just want to be clear on the process, i.e. will all CKAN updates
> automatically get rolled in?
> 
> Cheers,
> 
> L.
> --
> Leigh Dodds
> Programme Manager, Talis Platform
> Talis
> leigh.do...@talis.com
> http://www.talis.com




Re: New LOD Cloud Updates

2010-09-03 Thread Chris Bizer
Hi Kingsley,

 

> need to think more in terms of Galaxies re. Linked Open Data circa. 2010
:-)

 

Sure, just updated the tag list according to your proposals.

 

http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKAN
metainformation#CKAN_tags

 

Hope that we find enough color for this when drawing the cloud. Otherwise or
if they are not used, we will merge categories again .

 

Cheers,

 

Chris

 

 

Von: Kingsley Idehen [mailto:kide...@openlinksw.com] 
Gesendet: Freitag, 3. September 2010 14:06
An: Chris Bizer
Cc: bio2...@googlegroups.com; public-lod@w3.org
Betreff: Re: AW: New LOD Cloud Updates

 

On 9/3/10 2:59 AM, Chris Bizer wrote: 

Hi Egon,
 
> How are data sets divided over the various categories?

 

We are tagging the datasets with the following tags on CKAN in order to
assign them the categories:

 

*   media 
*   geographic 
*   lifesciences 
*   publications 
*   government 
*   usergeneratedcontent 
*   crossdomain 

 

This has been done for some of the datasets, but not all.

 

So, if you like your dataset to appear in a specific category and have a
specific color in the upcoming cloud diagram, please assign the
corresponding tag on CKAN to your dataset.


What about eCommerce? ICECat, ProductDB, BestBuy, and other components of
the GoodRelations solar system. We need to think more in terms of Galaxies
re. Linked Open Data circa. 2010 :-)

I would also suggest "socialweb" for FOAF solar system. Ditto "ontologies"
or "dictionaries" or "schemas" for the likes of: OpenCYC, SUMO, UMBEL, Yago,
FAO etc.. there is a lot of linkage occurring in the TBox realm that's
really valuable to LOD. Same thing for thesauri, which covers the burgeoning
SKOS solar system.

Kingsley



 

Cheers,

 

Chris

 

 

Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Kingsley Idehen
Gesendet: Donnerstag, 2. September 2010 23:32
An: bio2...@googlegroups.com; public-lod@w3.org
Betreff: Re: New LOD Cloud Updates

 

On 9/2/10 5:05 PM, Egon Willighagen wrote: 

Hi Kingsley,
 
On Thu, Sep 2, 2010 at 10:06 PM, Kingsley Idehen
<mailto:kide...@openlinksw.com>  wrote:

Note: http://www4.wiwiss.fu-berlin.de/lodcloud/

 
Nice page!
 
I see that I have some links to make, though I was already aware of
that :) But thanx for making it painfully clear :)
 
How are data sets divided over the various categories?
 
Egon
 

Anja,

Please pick up the issue above. 

Note, that PDB is no less than 13 Billion Triples. The Bio2RDF project folks
have been notified (hence reverse cc..) about this LOD cloud update.

Also the GoodRelations based LOC cloud is quite massive and although it
doesn't link specifically to DBpedia it does have cross links between ICECat
and Productdb.org (for instance). 

Also note that URIBurner's URIs provide links from LOC to many places in the
LOD cloud (esp. Geonames, OpenCalais, AlchemyAPI, and Zemanta). Ted and
others at OpenLink will deal with these numbers anyway.





-- 
 
Regards,
 
Kingsley Idehen   
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
<http://www.openlinksw.com/blog/%7Ekidehen> 
Twitter/Identi.ca: kidehen 
 
 
 
 






-- 
 
Regards,
 
Kingsley Idehen 
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 
 
 
 
 


Re: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included.

2010-09-03 Thread Chris Bizer
Hi Ted,

> But please ... this time, will there be any effort to make visible
> the clustering within the LOD Cloud?  This seems to me one of the
> best ways to encourage data set publishers to link out -- and that
> *is* important to grow the utility of the *overall* data set.

Hmm yes, maybe we should have a nice tidy-looking version for slides and in
addition a second more educational version ;-)

A good thing about the CKAN collection is that everybody can access the data
via the API and then convert is to a SVG graphic showing any aspect of the
data people are interested in. For instance, I would also like to see an
educational version of the cloud showing which dataset are properly licensed
and which are not.

Best,

Chris


-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Ted Thibodeau Jr
Gesendet: Donnerstag, 2. September 2010 22:42
An: Anja Jentzsch
Cc: public-lod@w3.org
Betreff: Re: Next version of the LOD cloud diagram. Please provide input, so
that your dataset is included.


On Sep 2, 2010, at 02:10 PM, Anja Jentzsch wrote:

> Hi all,
> 
> we are in the process of drawing the next version of the LOD cloud
diagram. This time it is likely to contain around 180 datasets altogether
having a size of around 20 billion RDF triples.



Cool!

But please ... this time, will there be any effort to make visible
the clustering within the LOD Cloud?  This seems to me one of the
best ways to encourage data set publishers to link out -- and that
*is* important to grow the utility of the *overall* data set.

To date, the only graphic I've seen which shows just how little
overall interconnectedness there is (was) in the LOD Cloud is my 
own ... which someone has long since removed from display on the 
EWC wiki page with the Cloud-like graphic, and which is certainly 
well outdated, but which is still found here --

   http://virtuoso.openlinksw.com/images/dbpedia-lod-cloud.html

Now, granted, mine didn't make the bubbles into a pretty cloud-like 
overall shape -- but it did reveal that most data sets (e.g., flickr
wrappr, Magnatune, Audioscrobbler) were only connecting to one or 
two others -- and I think that's important to see, just as it is 
to see which data sets several or many others connected to (e.g., 
DBpedia, Geonames, Musicbrainz), and which sets connected out to 
several or many others (e.g., Revyu, Linked MDB, the not-really-a-
data-set cloud of FOAF Profiles)...

Be seeing you,

Ted




--
A: Yes.  http://www.guckes.net/faq/attribution.html
| Q: Are you sure?
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.   //   voice +1-781-273-0900 x32
Evangelism & Support //mailto:tthibod...@openlinksw.com
 //  http://twitter.com/TallTed
OpenLink Software, Inc.  //  http://www.openlinksw.com/
10 Burlington Mall Road, Suite 265, Burlington MA 01803
 http://www.openlinksw.com/weblogs/uda/
OpenLink Blogs  http://www.openlinksw.com/weblogs/virtuoso/
   http://www.openlinksw.com/blog/~kidehen/
Universal Data Access and Virtual Database Technology Providers








AW: New LOD Cloud Updates

2010-09-03 Thread Chris Bizer
Hi Egon,
 
> How are data sets divided over the various categories?

 

We are tagging the datasets with the following tags on CKAN in order to
assign them the categories:

 

*   media 
*   geographic 
*   lifesciences 
*   publications 
*   government 
*   usergeneratedcontent 
*   crossdomain 

 

This has been done for some of the datasets, but not all.

 

So, if you like your dataset to appear in a specific category and have a
specific color in the upcoming cloud diagram, please assign the
corresponding tag on CKAN to your dataset.

 

Cheers,

 

Chris

 

 

Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Kingsley Idehen
Gesendet: Donnerstag, 2. September 2010 23:32
An: bio2...@googlegroups.com; public-lod@w3.org
Betreff: Re: New LOD Cloud Updates

 

On 9/2/10 5:05 PM, Egon Willighagen wrote: 

Hi Kingsley,
 
On Thu, Sep 2, 2010 at 10:06 PM, Kingsley Idehen
  wrote:

Note: http://www4.wiwiss.fu-berlin.de/lodcloud/

 
Nice page!
 
I see that I have some links to make, though I was already aware of
that :) But thanx for making it painfully clear :)
 
How are data sets divided over the various categories?
 
Egon
 

Anja,

Please pick up the issue above. 

Note, that PDB is no less than 13 Billion Triples. The Bio2RDF project folks
have been notified (hence reverse cc..) about this LOD cloud update.

Also the GoodRelations based LOC cloud is quite massive and although it
doesn't link specifically to DBpedia it does have cross links between ICECat
and Productdb.org (for instance). 

Also note that URIBurner's URIs provide links from LOC to many places in the
LOD cloud (esp. Geonames, OpenCalais, AlchemyAPI, and Zemanta). Ted and
others at OpenLink will deal with these numbers anyway.




-- 
 
Regards,
 
Kingsley Idehen   
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 
 
 
 
 


2nd CfP: Semantic Web Challenge @ ISWC 2010

2010-08-09 Thread Chris Bizer
Dear all,

this is a reminder that the submission deadline for the Semantic Web
Challenge 2010 is slowly approaching. The submission deadline is

October 1st, 2010, 12 a.m. (midnight) CET

The Semantic Web Challenge 2010 is collocated with the 9th International
Semantic Web Conference (ISWC2010) in Shanghai, China. As last year, the
challenge consists of two tacks: The Open Track and the Billion Triples
Track, which requires participants to make use of the data set that has been
crawled from the public Semantic Web. The data set consists of 3.2 billion
triples this year and can be downloaded from the challenge's website.   

The Call for Participation is found below. More information about the
Challenge is provided at

http://challenge.semanticweb.org/

We are looking forward to your submissions which as we hope will make the
Semantic Web Challenge again one of the most exciting events at ISWC.

Best regards,

Diana and Chris



--

Call for Participation for the 

8th Semantic Web Challenge 

at the 9th International Semantic Web Conference ISWC 2010 
Shanghai, China, November 7-11, 2010 

http://challenge.semanticweb.org/

--

Introduction

Submissions are now invited for the 8th annual Semantic Web Challenge, the
premier event for demonstrating practical progress towards achieving the
vision of the Semantic Web. The central idea of the Semantic Web is to
extend the current human-readable Web by encoding some of the semantics of
resources in a machine-processable form. Moving beyond syntax opens the door
to more advanced applications and functionality on the Web. Computers will
be better able to search, process, integrate and present the content of
these resources in a meaningful, intelligent manner. 

As the core technological building blocks are now in place, the next
challenge is to demonstrate the benefits of semantic technologies by
developing integrated, easy to use applications that can provide new levels
of Web functionality for end users on the Web or within enterprise settings.
Applications submitted should give evidence of clear practical value that
goes above and beyond what is possible with conventional web technologies
alone. 

As in previous years, the Semantic Web Challenge 2010 will consist of two
tracks: the Open Track and the Billion Triples Track. The key difference
between the two tracks is that the Billion Triples Track requires the
participants to make use of the data set (consisting of 3.2 billion triples
this year) that has been crawled from the Web and is provided by the
organizers. The Open Track has no such restrictions. As before, the
Challenge is open to everyone from industry and academia. The authors of the
best applications will be awarded prizes and featured prominently at special
sessions during the conference. 

The overall goal of this event is to advance our understanding of how
Semantic Web technologies can be exploited to produce useful applications
for the Web. Semantic Web applications should integrate, combine, and deduce
information from various sources to assist users in performing specific
tasks. 

---
Challenge Criteria

The Challenge is defined in terms of minimum requirements and additional
desirable features that submissions should exhibit. The minimum requirements
and the additional desirable features are listed below per track. 

Open Track

Minimal requirements

1. The application has to be an end-user application, i.e. an application
that provides a practical value to general Web users or, if this is not the
case, at least to domain experts. 
2. The information sources used should be under diverse ownership or control
should be heterogeneous (syntactically, structurally, and semantically), and
should contain substantial quantities of real world data (i.e. not toy
examples). The meaning of data has to play a central role. 
3. Meaning must be represented using Semantic Web technologies. 
4. Data must be manipulated/processed in interesting ways to derive useful
information and this semantic information processing has to play a central
role in achieving things that alternative technologies cannot do as well, or
at all; 

Additional Desirable Features 

In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions. 

1. The application provides an attractive and functional Web interface (for
human users) 
2. The application should be scalable (in terms of the amount of data used
and in terms of distributed components working together). Ideally, the
application should use all data that is currently published on the Semantic
Web. 
3. Rigorous evaluations have taken place that demonstrate the benefits of
semantic technologies, or validate the results obtained. 
4. Novelty, in applying semantic technology to a domain or task that have
not been considered before 

Open Position: Semantic Web Developer (JAVA, PHP, JS, 6 months fulltime in Berlin, m/f)

2010-07-02 Thread Chris Bizer
Hi all,

The MediaEvent Services GmbH, one of our spin-offs, is looking for a
Semantic Web Developer (m/f) in Berlin for a joint project with the
Web-based Systems Group at Freie Unviersität Berlin and ontoprise GmbH,
Karlsruhe, within the scope of Vulcan Inc.'s Project Halo.

The SMW-LDE project extends the SMW+ Semantic Enterprise Wiki with a Linked
Data import workflow, enabling the use of data from the global Linked Data
Web within SMW+. The technology will be initially deployed in an integrated
portal for biomedical data.

Your tasks:
- Integration and extension of Java and PHP components that perform import,
transformation and identity resolution of Linked Data
- Development of JavaScript-based editors
- Import and interlinking of biomedical datasets

Your profile:
- Solid Java and PHP skills; optionally Scala
- Prior experienced with JavaScript and common frameworks is desirable
- Diploma in computer science, related apprenticeship or several years of
work experience
- Prior biomedical knowledge is a plus

Experience with Semantic Web and Linked Data technologies is not imperative,
as we will provide in-depth training.
The MediaEvent Services GmbH offers an open, professional and informal
working environment, experienced colleagues and team play. Candidates must
be reliable and possess good English skills.

The position is initially limited until 31.01.2011; with high chances of
extension.

Project partners:

Founded in 1999, MediaEvent Services develops innovative systems in the
digital media field for customers such as Leica Microsystems and Hugo Boss.

The Web-based Systems Group at Freie Universität Berlin explores technical
and economic questions concerning the development of global, decentralized
information environments. Our current research focus are Linked Data
technologies for extending the World Wide Web with a global data commons. 

ontoprise is a leading independent software vendor for industry-proven
Semantic Web infrastructure technologies and products used to support
dynamic semantic information integration and information management
processes at the enterprise level.

Vulcan Inc. creates and advances a variety of world-class endeavors and high
impact initiatives that change and improve the way we live, learn, do
business, and experience the world.

How to apply:

Please send your applications in PDF format to c.bec...@mes-info.de,
focusing on previous experience in the fields.

Contact
Christian Becker
MediaEvent Services GmbH & Co. KG
Stendaler Straße 4
10559 Berlin
Tel. +49 (6441) 870 87-22
eMail c.bec...@mes-info.de
Web http://mediaeventservices.com

Have a nice weekend,

Chris

--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de






ANN: Silk - Link Discovery Framework Version 2.0 released.

2010-07-02 Thread Chris Bizer
Hi all,

we are happy to announce the second version of the Silk - Link Discovery
Framework for the Web of Data.

The Web of Data is built upon two simple ideas: Employ the RDF data model to
publish structured data on the Web and to set explicit RDF links between
entities within different data sources. While there are more and more tools
available for publishing Linked Data on the Web, there is still a lack of
scalable tools that support data publishers in setting RDF links to other
data sources on the Web. 

As the Web of Data is growing quickly, we thought the community would be in
need of an easy-to-use link generation tool which scales to the billion
triples use cases starting to arise on the Web of Data.

Therefore, we ported the Silk – framework from Python to Scala and
redesigned the internal data processing workflow so that Silk can handle
larger linking tasks. 

The new Silk 2.0 framework is about 20 times faster than the original
Python-based Silk 0.2 framework. On a Core2 Duo machine with 2GB RAM, Silk
2.0 computes around 180 million comparisons per hour. 

Other new features of Silk 2.0 include:

1. A blocking directive which allows users to reduce the number of
comparisons on cost of recall, if necessary. 
2. Support of the OAEI Alignment format as additional output format.
3. Slight redesign of the Silk-LSL syntax in order to make the language
easier to use. 
4. The Silk-LSL parser produces better error messages which makes debugging
linking specifications less cumbersome. 

We will keep on improving Silk and plan to have the next release in August.
This release will include

1.  Parallelization based on Hadoop. This will enable Silk to be
deployed on EC2
2.  Generating links from a stream of incoming RDF data items. This will
allow Silk to be used in conjunction with Linked Data crawlers like
LDspider. 

More information about the Silk framework, the Silk-LSL language
specification, as well as several examples that demonstrate how Silk is used
to set links between different data sources in the LOD cloud is found at

http://www4.wiwiss.fu-berlin.de/bizer/silk/

The Silk framework is provided under the terms of the Apache License,
Version 2.0 and can be downloaded from

http://sourceforge.net/projects/silk2/

The Silk 2.0 – User Manual and Language Reference is found at

http://www4.wiwiss.fu-berlin.de/bizer/silk/spec/


Lots of thanks to

1. Robert Isele and Anja Jentzsch (both Freie Universität Berlin) who
reimplemented Silk in Scala, introduced the blocking features and redesigned
the Silk-LSL language.
2. Julius Volz (Google) who designed and implemented the original Python
version of the Silk Framework.
3. Vulcan Inc. (http://www.vulcan.com/) which enabled us to do this work by
sponsoring Silk as part of its Project Halo (www.projecthalo.com). 

Happy linking,

Chris Bizer, Robert Isele and Anja Jentzsch


--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de





Semantic Web Challenge @ ISWC 2010 - Call for Participation

2010-05-31 Thread Chris Bizer
Dear all,

we are happy to announce the Semantic Web Challenge 2010!

The Semantic Web Challenge 2010 is collocated with the 9th International
Semantic Web Conference (ISWC2010) in Shanghai, China. As last year, the
challenge consists of two tacks: The Open Track and the Billion Triples
Track, which requires participants to make use of the data set that has been
crawled from the public Semantic Web. The data set consists of 3.2 billion
triples this year and can be downloaded from the challenge's website.   

The Call for Participation is found below. More information about the
Challenge is provided at

http://challenge.semanticweb.org/

We are looking forward to your submissions which as we hope will make the
Semantic Web Challenge again one of the most exciting events at ISWC.

Best regards,

Diana and Chris


--

Call for Participation for the 

8th Semantic Web Challenge 

at the 9th International Semantic Web Conference ISWC 2010 
Shanghai, China, November 7-11, 2010 

http://challenge.semanticweb.org/

--

Introduction

Submissions are now invited for the 8th annual Semantic Web Challenge, the
premier event for demonstrating practical progress towards achieving the
vision of the Semantic Web. The central idea of the Semantic Web is to
extend the current human-readable Web by encoding some of the semantics of
resources in a machine-processable form. Moving beyond syntax opens the door
to more advanced applications and functionality on the Web. Computers will
be better able to search, process, integrate and present the content of
these resources in a meaningful, intelligent manner. 

As the core technological building blocks are now in place, the next
challenge is to demonstrate the benefits of semantic technologies by
developing integrated, easy to use applications that can provide new levels
of Web functionality for end users on the Web or within enterprise settings.
Applications submitted should give evidence of clear practical value that
goes above and beyond what is possible with conventional web technologies
alone. 

As in previous years, the Semantic Web Challenge 2010 will consist of two
tracks: the Open Track and the Billion Triples Track. The key difference
between the two tracks is that the Billion Triples Track requires the
participants to make use of the data set (consisting of 3.2 billion triples
this year) that has been crawled from the Web and is provided by the
organizers. The Open Track has no such restrictions. As before, the
Challenge is open to everyone from industry and academia. The authors of the
best applications will be awarded prizes and featured prominently at special
sessions during the conference. 

The overall goal of this event is to advance our understanding of how
Semantic Web technologies can be exploited to produce useful applications
for the Web. Semantic Web applications should integrate, combine, and deduce
information from various sources to assist users in performing specific
tasks. 

---
Challenge Criteria

The Challenge is defined in terms of minimum requirements and additional
desirable features that submissions should exhibit. The minimum requirements
and the additional desirable features are listed below per track. 

Open Track

Minimal requirements

1. The application has to be an end-user application, i.e. an application
that provides a practical value to general Web users or, if this is not the
case, at least to domain experts. 
2. The information sources used 
should be under diverse ownership or control 
should be heterogeneous (syntactically, structurally, and semantically), and

should contain substantial quantities of real world data (i.e. not toy
examples). 
The meaning of data has to play a central role. 
3. Meaning must be represented using Semantic Web technologies. 
4. Data must be manipulated/processed in interesting ways to derive useful
information and this semantic information processing has to play a central
role in achieving things that alternative technologies cannot do as well, or
at all; 

Additional Desirable Features 

In addition to the above minimum requirements, we note other desirable
features that will be used as criteria to evaluate submissions. 

1. The application provides an attractive and functional Web interface (for
human users) 
2. The application should be scalable (in terms of the amount of data used
and in terms of distributed components working together). Ideally, the
application should use all data that is currently published on the Semantic
Web. 
3. Rigorous evaluations have taken place that demonstrate the benefits of
semantic technologies, or validate the results obtained. 
4. Novelty, in applying semantic technology to a domain or task that have
not been considered before 
5. Functionality is different from or goes beyond pure information retrieval

6. The application has clear commercia

Please report bugs to be fixed for the DBpedia 3.5.1 release

2010-04-15 Thread Chris Bizer
Hi all,

> Great stuff, this is also why we are going to leave the current DBpedia
> 3.5 instance to stew for a while (until end of this week or a little later).
> 
> DBpedia users:
> Now is the time to identify problems with the DBpedia 3.5 dataset dumps.
> We don't want to continue reloading DBpedia (Static Edition and then
> recalibrating DBpedia-Live) based on faulty datasets related matters, we
> do have other operational priorities etc..

Yes, the testing by the community has exposed enough small and medium bugs in 
the datasets so that we are going to extract a new fixed 3.5.1. release next 
week.

I'm my opinion the bugs do not impair Robert's and Anja's great achievement of 
porting the extraction framework from PHP to Scala. If you rewrite more than 
10.000 lines of code for something as complex as a multilingual Wikipedia 
extraction, I think it is normal that some minor bugs remain even after their 
tough testing.

So, if you have discovered additional bugs and want them fixed.

Please report them to the DBpedia bug tracker until Friday EOB.

http://sourceforge.net/tracker/?group_id=190976


Cheers,

Chris
 

> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
> von Kingsley Idehen
> Gesendet: Donnerstag, 15. April 2010 15:44
> An: Andy Seaborne
> Cc: public-lod@w3.org; dbpedia-discussion
> Betreff: Re: DBpedia hosting burden
> 
> Andy Seaborne wrote:
> > I ran the files from
> > http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt
> > through an N-Triples parser with checking:
> >
> > The report is here (it's 25K lines long):
> >
> > http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt
> >
> > It covers both strict errors and warnings of ill-advised forms.
> >
> > A few examples:
> >
> > Bad IRI: <=?(''[[Nepenthes>
> > Bad IRI: 
> >
> > Bad lexical forms for the value space:
> > "1967-02-31"^^http://www.w3.org/2001/XMLSchema#date
> > (there is no February the 31st)
> >
> >
> > Warning of well known ports of other protocols:
> > http://stream1.securenetsystems.net:443
> >
> > Warning about explicit about port 80:
> >
> > http://bibliotecadigitalhispanica.bne.es:80/
> >
> > and use of . and .. in absolute URIs which are all from the standard
> > list of IRI warnings.
> >
> > Bad IRI:  Code:
> > 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment /../
> > not at the beginning of a relative reference, or it contains a /./
> > These should be removed.
> >
> > Andy
> >
> > Software used:
> >
> > The IRI checker, by Jeremy Carroll, is available from
> > http://www.openjena.org/iri/ and Maven.
> >
> > The lexical form checking is done by Apache Xerces.
> >
> > The N-triples parser is the one from TDB v0.8.5 which bundles the
> > above two together.
> >
> >
> > On 15/04/2010 9:54 AM, Malte Kiesel wrote:
> >> Ivan Mikhailov wrote:
> >>
> >>> If I were The Emperor of LOD I'd ask all grand dukes of datasources to
> >>> put fresh dumps at some torrent with control of UL/DL ratio :)
> >>
> >> Last time I checked (which was quite a while ago though), loading
> >> DBpedia in a normal triple store such as Jena TDB didn't work very well
> >> due to many issues with the DBpedia RDF (e.g., problems with the URIs of
> >> external links scraped from Wikipedia).
> >>
> >> I don't know whether this is a bug in TDB or DBpedia but I guess this is
> >> one of the problems causing people to use DBpedia online only - even if,
> >> due to performance reasons, running it locally would be far better.
> >>
> >> Regards
> >> Malte
> >>
> >
> >
> Andy,
> 
> Great stuff, this is also why we are going to leave the current DBpedia
> 3.5 instance to stew for a while (until end of this week or a little later).
> 
> DBpedia users:
> Now is the time to identify problems with the DBpedia 3.5 dataset dumps.
> We don't want to continue reloading DBpedia (Static Edition and then
> recalibrating DBpedia-Live) based on faulty datasets related matters, we
> do have other operational priorities etc..
> 
> 
> --
> 
> Regards,
> 
> Kingsley Idehen
> President & CEO
> OpenLink Software
> Web: http://www.openlinksw.com
> Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca: kidehen
> 
> 
> 
> 





ANN: DBpedia 3.5 released

2010-04-12 Thread Chris Bizer
 Leipzig) for providing the
knowledge base via the DBpedia download server at Universität Leipzig. 

* Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the
knowledge base into the Virtuoso instance that serves the Linked Data view
and SPARQL endpoint. 

The whole DBpedia team is very thankful to three companies which enabled us
to do all this by supporting and sponsoring the DBpedia project:

* Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based company
offering leading technologies in the area of Web search, social media and
mobile applications.

* Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc.
creates and advances a variety of world-class endeavors and high impact
initiatives that change and improve the way we live, learn, do business
(http://www.vulcan.com/).

* OpenLink Software (http://www.openlinksw.com/). OpenLink Software develops
the Virtuoso Universal Server, an innovative enterprise grade server that
cost-effectively delivers an unrivaled platform for Data Access, Integration
and Management. 

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base! 

Cheers, 

Chris Bizer


--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de




OPEN POSITION: Move to Berlin, work on DBpedia (1 year full-time contract)

2010-03-29 Thread Chris Bizer
Hi all,

DBpedia [1] is a community effort to extract structured information from
Wikipedia and to make this information available on the Web. DBpedia allows
you to ask sophisticated queries against Wikipedia knowledge [2]. DBpedia
also plays a central role as an interlinking hub in the emerging Web of Data
[3].

The DBpedia Team at Freie Universität Berlin [4] is looking for a
developer/researcher who wants to contribute to the further development of
the DBpedia information extraction framework, investigate approaches to
annotate free-text with DBpedia URIs and participate in the various Linked
Data efforts currently advanced by our team.

Candidates should have 
+ good programming skills in Java, in addition Scala and PHP are helpful. 
+ a university degree preferably in computer science or information systems.


Previous knowledge of Semantic Web Technologies (RDF, SPARQL, Linked Data)
and experience with information extraction and/or named entity recognition
techniques are a plus.

Contract start date: 15 May 2010
Duration: 1 year
Salary: around 40.000 Euro/year (German BAT IIa)

You will be part of an innovative and cordial team and enjoy flexible work
hours. After the year, chances are high that you will be able to choose
between longer-term positions at Freie Universität Berlin and at neofonie
GmbH.

Please contact Chris Bizer via email (ch...@bizer.de) until 15 April 2010
for additional details and include information about your skills and
experience.

The whole DBpedia team is very thankful to neofonie GmbH [5] for
contributing to the development of DBpedia by financing this position.
neofonie is a Berlin-based company offering leading technologies in the area
of Web search, social media and mobile applications.

Cheers,

Chris


[1] http://dbpedia.org/
[2] http://wiki.dbpedia.org/FacetedSearch
[3] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[4] http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/
[5] http://www.neofonie.de/

--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de




Invitation to contribute to DBpedia by improving the infobox mappings + New Scala-based Extraction Framework

2010-03-12 Thread Chris Bizer
Hi all,

in order to extract high quality data from Wikipedia, the DBpedia extraction
framework relies on infobox to ontology mappings which define how Wikipedia
infobox templates are mapped to classes of the DBpedia ontology.

Up to now, these mappings were defined only by the DBpedia team and as
Wikipedia is huge and contains lots of different infobox templates, we were
only able to define mappings for a small subset of all Wikipedia infoboxes
and also only managed to map a subset of the properties of these infoboxes.

In order to enable the DBpedia user community to contribute to improving the
coverage and the quality of the mappings, we have set up a public wiki at 

http://mappings.dbpedia.org/index.php/Main_Page 

which contains: 

1. all mappings that are currently used by the DBpedia extraction framework
2. the definition of the DBpedia ontology and
3. documentation for the DBpedia mapping language as well as step-by-step
guides on how to extend and refine mappings and the ontology.

So if you are using DBpedia data and you you were always annoyed that
DBpedia did not properly cover the infobox template that is most important
to you, you are highly invited to extend the mappings and the ontology in
the wiki. Your edits will be used for the next DBpedia release expected to
be published in the first week of April.

The process of contributing to the ontology and the mappings is as follows:

1.  You familiarize yourself with the DBpedia mapping language by reading
the documentation in the wiki.
2.  In order to prevent random SPAM, the wiki is read-only and new editors
need to be confirmed by a member of the DBpedia team (currently Anja
Jentzsch does the clearing). Therefore, please create an account in the wiki
for yourself. After this, Anja will give you editing rights and you can edit
the mappings as well as the ontology.
3. For contributing to the next DBpedia relase, you can edit until Sunday,
March 21. After this, we will check the mappings and the ontology definition
in the Wiki for consistency and then use both for the next DBpedia release.

So, we are starting kind of a social experiment on if the DBpedia user
community is willing to contribute to the improvement of DBpedia and on how
the DBpedia ontology develops through community contributions :-)

Please excuse, that it is currently still rather cumbersome to edit the
mappings and the ontology. We are currently working on a visual editor for
the mappings as well as a validation service, which will check edits to the
mappings and test the new mappings against example pages from Wikipedia. We
hope that we will be able to deploy these tools in the next two months, but
still wanted to release the wiki as early as possible in order to already
allow community contributions to the DBpedia 3.5 release.

If you have questions about the wiki and the mapping language, please ask
them on the DBpedia mailing list where Anja and Robert will answer them.

What else is happening around DBpedia?

In order to speed up the data extraction process and to lay a solid
foundation for the DBpedia Live extraction, we have ported the DBpedia
extraction framework from PHP to Scala/Java. The new framework extracts
exactly the same types of data from Wikipedia as the old framework, but
processes a single page now in 13 milliseconds instead of the 200
milliseconds. In addition, the new framework can extract data from tables
within articles and can handle multiple infobox templates per article. The
new framework is available under GPL license in the DBpedia SVN and is
documented at http://wiki.dbpedia.org/Documentation.

The whole DBpedia team is very thankful to two companies which enabled us to
do all this by sponsoring the DBpedia project:

1. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan
Inc. creates and advances a variety of world-class endeavors and high impact
initiatives that change and improve the way we live, learn, do business
(http://www.vulcan.com/).
2.  Neofonie GmbH, a Berlin-based company offering leading technologies in
the area of Web search, social media and mobile applications
(http://www.neofonie.de/index.jsp).

Thank you a lot for your support!

I personally would also like to thank:

1.  Anja Jentzsch, Robert Isele, and Christopher Sahnwaldt for all their
great work on implementing the new extraction framework and for setting up
the mapping wiki.
2.  Andreas Lange and Sidney Bofah for correcting and extending the mappings
in the Wiki.

Cheers, 

Chris 


--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de





CfP: Linked Data on the Web (LDOW2010) Workshop at WWW2010

2009-12-15 Thread Chris Bizer
Hi all,

 

we are happy to announce that there will be a 3rd edition of the Linked Data
on the Web workshop at WWW2010 in Raleigh.

 

After Web of Data has grown for another year, we think that it will be
exciting to bring the community together at LDOW2010 and discuss what we
have achieved so far and what will be the most important topics for the year
to come.

 

The LDOW2010 website is found at http://events.linkeddata.org/ldow2010/

 

The Call for Papers is found below.

 

We are looking forward to see you in Raleigh.

 

Cheers,

 

Chris Bizer

Tom Heath

Tim Berners-Lee

Michael Hausenblas

 

 

 

==  Call for Papers  ===

 

Linked Data on the Web (LDOW2010) Workshop

 

at WWW2010

 

=

 

April 27th or 28th, 2010

Raleigh, USA

 

=

Objectives

 

The Web is increasingly understood as a global information space consisting
not just of linked documents, but also of linked data. More than just a
vision, the resulting Web of Data has been brought into being by the
maturing of the Semantic Web technology stack, and by the publication of
large datasets according to the principles of Linked Data. To date, the Web
of Data has grown to a size of roughly 13.1 billion RDF triples, with
contributions coming increasingly from companies, government and public
sector projects, as well as from individual Web enthusiasts. In addition to
publishing and interlinking datasets, there is intensive work on Linked Data
browsers, Web of Data search engines and other applications that consume
Linked Data from the Web.

 

LDOW2010 follows the successful LDOW2008 workshop at WWW2008 in Beijing and
the LDOW2009 workshop at WWW2009 in Madrid. As the publication of Linked
Data on the Web continues apace, the need becomes more pressing for
principled research in the areas of user interfaces for the Web of Data as
well as on issues of quality, trust and provenance in Linked Data. We also
expect to see a number of submissions related to current areas of high
Linked Data activity, such as government transparency, life sciences and the
media industry. The goal of this workshop is to provide a forum for exposing
high quality, novel research and applications in these (and related) areas.
In addition, by bringing together researchers in this field, we expect the
event to further shape the ongoing Linked Data research agenda.

 

=

Topics of Interest

 

Topics of interest for the workshop include, but are not limited to, the
following:

 

1. Linked Data Application Architectures 

   * crawling, caching and querying Linked Data 

   * dataset dynamics and synchronization 

   * Linked Data mining 

 

2. Data Linking and Data Fusion 

   * linking algorithms and heuristics, identity resolution 

   * Web data integration and data fusion 

   * link maintanance 

   * performance of linking infrastructures/algorithms on Web data 

 

3. Quality, Trust and Provenance in Linked Data 

   * tracking provenance and usage of Linked Data 

   * evaluating quality and trustworthiness of Linked Data 

   * profiling of Linked Data sources 

 

4. User Interfaces for the Web of Data 

   * approaches to visualizing and interacting with distributed Web data 

   * Linked Data browsers and search engines 

 

5. Data Publishing 

   * tools for publishing large data sources as Linked Data on the Web (e.g.
relational databases, XML repositories) 

   * embedding data into classic Web documents (e.g. RDFa, Microformats) 

   * describing data on the Web (e.g. VoiD, Semantic Site Map) 

   * licensing issues in Linked Data publishing 

 

6. Business models for Linked Data publishing and consumption 

 

 

=

Submissions

 

We seek three kinds of submissions:

 

1. Full technical papers: up to 10 pages in ACM format 

2. Short technical and position papers: up to 5 pages in ACM format 

3. Demo description: up to 2 pages in ACM format 

 

Submissions must be formatted using the WWW2010 templates available at
http://www2010.org/www/authors/submissions/formatting-guidelines/.

Submissions will be peer reviewed by three independent reviewers. Accepted
papers will be presented at the workshop and included in the workshop
proceedings.

 

Please submit your paper via EasyChair at
http://www.easychair.org/conferences/?conf=ldow2010

 

=

Important Dates

 

Submission deadline: 15th February 2010, 23.59 Hawaii time 

Notification of acceptance: 8th March 2010 

Camera-ready versions of accepted papers: 21st March 2010 

Workshop date: 27th or 28th April 2010 

 

=

Organising Committee

 

Christian Bizer, Freie Universität Berlin, Germany 

Tom Heath, Talis Information Ltd, UK 

Tim Berners-Lee, MIT CSAIL, USA 

Michael Hausenblas, DERI, NUI Galway, Ireland

Re: RDF Update Feeds + URI time travel on HTTP-level

2009-11-20 Thread Chris Bizer
Hi Michael, Georgi and all,

just to complete the list of proposals, here another one from Herbert Van de
Sompel from the Open Archives Initiative.

Memento: Time Travel for the Web
http://arxiv.org/abs/0911.1112

The idea of Memento is to use HTTP content negotiation in the datetime
dimension. By using a newly introduced X-Accept-Datetime HTTP header they
add a temporal dimension to URIs. The result is a framework in which
archived resources can seamlessly be reached via the URI of their original.

Sounds cool to me. Anybody an opinion whether this violates general Web
architecture somewhere?
Anybody aware of other proposals that work on HTTP-level?

Have a nice weekend,

Chris



> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
Auftrag
> von Georgi Kobilarov
> Gesendet: Freitag, 20. November 2009 18:48
> An: 'Michael Hausenblas'
> Cc: Linked Data community
> Betreff: RE: RDF Update Feeds
> 
> Hi Michael,
> 
> nice write-up on the wiki! But I think the vocabulary you're proposing is
> too much generally descriptive. Dataset publishers, once offering update
> feeds, should not only tell that/if their datasets are "dynamic", but
> instead how dynamic they are.
> 
> Could be very simple by expressing: "Pull our update-stream once per
> seconds/minute/hour in order to be *enough* up-to-date".
> 
> Makes sense?
> 
> Cheers,
> Georgi
> 
> --
> Georgi Kobilarov
> www.georgikobilarov.com
> 
> > -Original Message-
> > From: Michael Hausenblas [mailto:michael.hausenb...@deri.org]
> > Sent: Friday, November 20, 2009 4:01 PM
> > To: Georgi Kobilarov
> > Cc: Linked Data community
> > Subject: Re: RDF Update Feeds
> >
> >
> > Georgi, All,
> >
> > I like the discussion, and as it seems to be a recurrent pattern as
> > pointed
> > out by Yves (which might be a sign that we need to invest some more
> > time
> > into it) I've tried to sum up a bit and started a straw-man proposal
> > for a
> > more coarse-grained solution [1].
> >
> > Looking forward to hearing what you think ...
> >
> > Cheers,
> >   Michael
> >
> > [1] http://esw.w3.org/topic/DatasetDynamics
> >
> > --
> > Dr. Michael Hausenblas
> > LiDRC - Linked Data Research Centre
> > DERI - Digital Enterprise Research Institute
> > NUIG - National University of Ireland, Galway
> > Ireland, Europe
> > Tel. +353 91 495730
> > http://linkeddata.deri.ie/
> > http://sw-app.org/about.html
> >
> >
> >
> > > From: Georgi Kobilarov 
> > > Date: Tue, 17 Nov 2009 16:45:46 +0100
> > > To: Linked Data community 
> > > Subject: RDF Update Feeds
> > > Resent-From: Linked Data community 
> > > Resent-Date: Tue, 17 Nov 2009 15:46:30 +
> > >
> > > Hi all,
> > >
> > > I'd like to start a discussion about a topic that I think is getting
> > > increasingly important: RDF update feeds.
> > >
> > > The linked data project is starting to move away from releases of
> > large data
> > > dumps towards incremental updates. But how can services consuming rdf
> > data
> > > from linked data sources get notified about changes? Is anyone aware
> > of
> > > activities to standardize such rdf update feeds, or at least aware of
> > > projects already providing any kind of update feed at all? And
> > related to
> > > that: How do we deal with RDF diffs?
> > >
> > > Cheers,
> > > Georgi
> > >
> > > --
> > > Georgi Kobilarov
> > > www.georgikobilarov.com
> > >
> > >
> > >





ANN: DBpedia 3.4 released

2009-11-11 Thread Chris Bizer
Hi all,

we are happy to announce the release of DBpedia 3.4. The new release is
based on Wikipedia dumps dating from September 2009. 

The new DBpedia data set describes more than 2.9 million things, including
282,000 persons, 339,000 places, 88,000 music albums, 44,000 films, 15,000
video games, 119,000 organizations, 130,000 species and 4400 diseases. The
DBpedia data set now features labels and abstracts for these things in 91
different languages; 807,000 links to images and 3,840,000 links to external
web pages; 4,878,100 external links into other RDF datasets, 415,000
Wikipedia categories, and 75,000 YAGO categories. The data set consists of
479 million pieces of information (RDF triples) out of which 190 million
were extracted from the English edition of Wikipedia and 289 million were
extracted from other language editions. 

The new release provides the following improvements and changes compared to
the DBpedia 3.3 release:

1. the data set has been extracted from more recent Wikipedia dumps.
2. the data set now provides labels, abstracts and infobox data in 91
different languages.
3. we provide two different version of the DBpedia Infobox Ontology (loose
and strict) in order to meet different application requirements. Please
refer to http://wiki.dbpedia.org/Datasets#h18-11 for details.
4. as Wikipedia has moved to dual-licensing, we also dual-license DBpedia.
The DBpedia 3.4 data set is licensed under the terms of the Creative Commons
Attribution-ShareAlike 3.0 license and the GNU Free Documentation License.
5. the mapping-based infobox data extractor has been improved and now
normalizes units of measurement.
6. various bug fixes and improvements throughout the code base. Please refer
to the change log for the complete list http://wiki.dbpedia.org/Changelog

You can download the new DBpedia dataset from
http://wiki.dbpedia.org/Downloads34. As usual, the dataset is also available
as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to

* Anja Jentzsch, Christopher Sahnwaldt, Robert Isele, and Paul Kreis (all
Freie Universität Berlin) for improving the DBpedia extraction framework and
for extracting the new data set.
* Jens Lehmann and Sören Auer (both Universität Leipzig) for providing new
data set via the DBpedia download server at Universität Leipzig.
* Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the
dataset into the Virtuoso instance that serves the Linked Data view and
SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether
for providing the server infrastructure for DBpedia.
* neofonie GmbH (http://www.neofonie.de/index.jsp) for supporting the
DBpedia project by paying Christopher Sahnwaldt.

The next steps for the DBpedia project will be to

1. synchronize Wikipedia and DBpedia by deploying the DBpedia live
extraction which updates the DBpedia knowledge base immediately when a
Wikipedia article changes. 
2. enable the DBpedia user community to edit and maintain the DBpedia
ontology and the infobox mappings that are used by the extraction framework
in a public Wiki. 
3. increase the quality of the extracted data by improving and fine-tuning
the extraction code.

All this hopefully will happen soon.

More information about DBpedia is found at http://dbpedia.org/About


Have fun with the new data set!

Cheers

Chris Bizer


--
Chris Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de





AW: Linked Data Gathering at ISWC2009?

2009-10-22 Thread Chris Bizer
Hi Juan,

Tom and I have been overly busy with writing EC grants over the last weeks,
so the ISWC linked data gathering somehow slipped though.

It would be great if you and Olaf would organize the gathering.

The poster session on Tuesday goes until late and there will be a lot of
Linked Data related applications being presented as part of the Semantic Web
Challenge part of the poster session at this time.

So what about having the gathering Wednesday evening?

Cheers,

Chris


> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
Auftrag
> von Juan Sequeda
> Gesendet: Donnerstag, 22. Oktober 2009 08:34
> An: public-lod@w3.org
> Betreff: Linked Data Gathering at ISWC2009?
> 
> Hi everybody!
> We are a bit surprised that nobody has proposed a Linked Data
> gathering at ISWC2009.
> 
> Olaf and I would commit to organize a gathering this year. The
> possible dates are either Oct 26 (first day of the conference or Oct
> 27 (after the poster session).
> 
> What do you all think?
> 
> (A quick reminder: don't forget about the Linked Data-a-thon. More
> info at www.linkeddata-a-thon.com)
> 
> Juan Sequeda
> www.juansequeda.com




Semantic Web Challenge 2009: Submission deadline approaching!

2009-09-08 Thread Chris Bizer
Hello everybody,

Peter and I want to invite you again to submit to the Semantic Web Challenge 
2009 and remind you that the submission deadline is slowly approaching:

The deadline is October 1, 2009. 

The submission form is already open and you can submit your challenge entry at

http://challenge.semanticweb.org/

If you have any questions concerning the challenge, please let us know.

We are looking forward to your submissions and to showcase the current 
state-of-the-art in Semantic Web applications through the Semantic Web 
Challenge!

Kind regards,

Peter Mika and Chris Bizer






Call for Participation

7th Semantic Web Challenge - Open Track and Billion Triples Track

at the

8th International Semantic Web Conference (ISWC 2009)

Chantilly, Virginia, USA
October 25-29, 2009

http://challenge.semanticweb.org/



We invite submissions to the seventh annual Semantic Web Challenge, the 
premiere event for demonstrating practical progress towards achieving the 
vision of the Semantic Web.

The central idea of the Semantic Web is to extend the current human- readable 
Web by encoding some of the semantics of resources in a machine-processable 
form. Moving beyond syntax opens the door to more advanced applications and 
functionality on the Web. Computers will be better able to search, process, 
integrate and present the content of these resources in a meaningful, 
intelligent manner.

As the core technological building blocks are now in place, the next challenge 
is to show off the benefits of semantic technologies by developing integrated, 
easy to use applications that can provide new levels of Web functionality for 
end users on the Web or within enterprise settings. Applications submitted 
should demonstrate clear practical value that goes above and beyond what is 
possible with conventional web technologies alone.

The Semantic Web Challenge of 2009 will consist of two tracks: the Open Track 
and the Billion Triples Track. The key difference between the two tracks is 
that the Billion Triples Track requires the participants to make use of the 
data set — a billion triples — that has been crawled from the Web and is 
provided by the organizers. The Open Track has no such restrictions.

As before, the Challenge is open to everyone from academia and industry. The 
authors of the best applications will be awarded prizes and featured 
prominently at special sessions during the conference.


GOALS
-
The overall goal of this event is to advance our understanding of how semantic 
technologies can be exploited to produce useful applications for the Web. 
Semantic Web applications should integrate, combine, and deduce information 
from various sources to assist users in performing specific tasks.

The specific goal of the Billion Triples Track is to demonstrate the 
scalability of applications as well as capability to deal with the specifics of 
data that has been crawled from the public Web.

We stress that the goal of this is not to be a benchmarking effort between 
triple stores, but rather to demonstrate applications that can work on Web 
scale using realistic Web-quality data.


Minimal Requirements

Submissions for the Semantic Web Challenge must meet the following minimum 
requirements:

For the Open Track:
~~~

1. The meaning of data has to play a central role.
* Meaning must be represented using formal descriptions.
* Data must be manipulated/processed in interesting ways to derive useful 
information and
* this semantic information processing has to play a central role in achieving 
things that alternative technologies cannot do as well, or at all; 2. The 
information sources used
* should be under diverse ownership or control
* should be heterogeneous (syntactically, structurally, and semantically), and
* should contain substantial quantities of real world data (i.e.
not toy examples).
3. The application has to be an end-user application, i.e. an application that 
provides a practical value to domain experts.

Although we expect that most applications will use RDF, RDF Schema, or OWL this 
is not a requirement. What is more important is that whatever semantic 
technology is used, it plays a central role in achieving interesting new levels 
of functionality or performance.

It is required that all applications assume an open world, i.e. that the 
information is never complete.

Additional Desirable Features
-
In addition to the above minimum requirements, we note other desirable features 
that will be used as criteria to evaluate submissions.
- The application provides an attractive and functional Web interface (for 
human users)
- The application should be scalable (in terms of the amount of data used and 
in terms of distributed components working together).
Ideally, the application should use

AW: [Dbpedia-discussion] Fwd: Your message to Dbpedia-discussion awaits moderator approval

2009-08-11 Thread Chris Bizer

Hi Kingsley, Pat and all,

> Chris/Anja: I believe this data set was touched on your end, right?

Yes, Anja will fix the file and will send an updated version.

Pat Hayes wrote:
>> This website should be taken down immediately, before it does serious 
>> harm. It is irresponsible to publish such off-the-wall equivalentClass 
>> assertions.

Pat: Your comment seems to imply that you see the Semantic Web as something
consistent that can be broken by individual information providers publishing
false information. If this is the case, the Semantic Web will never fly!

Everything on the Web is a claim by somebody. There are no facts, there is
no truth, there are only opinions.

Semantic Web applications must take this into account and therefore always
assess data quality and trustworthiness before they do something with the
data. If you build applications that brake once somebody publishes false
information, you are obviously doomed.

As I thought this would be generally understood, I'm very surprised by your
comment.

Cheers,

Chris


> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
Auftrag
> von Kingsley Idehen
> Gesendet: Montag, 10. August 2009 23:29
> An: Kavitha Srinivas
> Cc: Tim Finin; Anja Jentzsch; public-lod@w3.org; dbpedia-
> discuss...@lists.sourceforge.net; Chris Bizer
> Betreff: Re: [Dbpedia-discussion] Fwd: Your message to Dbpedia-discussion
> awaits moderator approval
> 
> Kavitha Srinivas wrote:
> > I will fix the URIs.. I believe the equivalenceClass assertions were
> > added in by someone at OpenLink (I just sent the raw file with the
> > conditional probabilities for each pair of types that were above the
> > .80 threshold).  So can whoever uploaded the file fix the property to
> > what Tim suggested?
> Hmm,  I didn't touch the file, neither did anyone else at OpenLink. I
> just downloaded what was uploaded at:
> http://wiki.dbpedia.org/Downloads33, any based on my own personal best
> practices, put the data in a separate Named Graph :-)
> 
> Chris/Anja: I believe this data set was touched on your end, right?
> Please make the fixes in line with the findings from the conversation on
> this thread. Once corrected, I or someone else will reload.
> 
> Kingsley
> 
> > Thanks!
> > Kavitha
> > On Aug 10, 2009, at 5:03 PM, Kingsley Idehen wrote:
> >
> >> Kavitha Srinivas wrote:
> >>> Agree completely -- which is why I sent a base file which had the
> >>> conditional probabilities, the mapping, and the values to be able to
> >>> compute marginals.
> >>> About the URIs, I should have added in my email that because
> >>> freebase types are not URIs, and have types such as /people/person,
> >>> we added a base URI: http://freebase.com to the types.  Sorry I
> >>> missed mentioning that...
> >>> Kavitha
> >> Kavitha,
> >>
> >> If you apply the proper URIs, and then apply fixes to the mappings
> >> (from prior suggestions) we are set.  You can send me another dump
> >> and I will go one step further and put some sample SPARQL queries
> >> together which demonstrate how we can have many world views on the
> >> Web of Linked Data without anyone getting hurt in the process :-)
> >>
> >> Kingsley
> >>>
> >>> On Aug 10, 2009, at 4:42 PM, Tim Finin wrote:
> >>>
> >>>> Kavitha Srinivas wrote:
> >>>>> I understand what you are saying -- but some of this reflects the
> >>>>> way types are associated with freebase instances.  The types are
> >>>>> more like 'tags' in the sense that there is no hierarchy, but each
> >>>>> instance is annotated with multiple types.  So an artist would in
> >>>>> fact be annotated with person reliably (and probably less
> >>>>> consistently with /music/artist).  Similar issues with Uyhurs,
> >>>>> murdered children etc.  The issue is differences in modeling
> >>>>> granularity as well.  Perhaps a better thing to look at are types
> >>>>> where the YAGO types map to Wordnet (this is usually at a coarser
> >>>>> level of granularity).
> >>>>
> >>>> One way to approach this problem is to use a framework to mix logical
> >>>> constraints with probabilistic ones.  My colleague Yun Peng has been
> >>>> exploring integrating data backed by OWL ontologies with Bayesian
> >>>> information,
> >>>> with applications for ontology mapping.  See [1] for recent papers
> >&

AW: "How to Publish Linked Data" vs "Dereferencing HTTP URIs"

2009-07-09 Thread Chris Bizer
Hi Christopher,

> Q1) Should HtPLDotW be updated to remove references to
>  "Dereferencing HTTP URIs" in favor of "Cool URIs for the
>  Semantic Web"

Yes. The HtPLDotW document is from 2007 and urgently needs updating.
We are planning to do this during the summer, so expect a new version in
fall.

Up till then you might consider
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf as
a more recent reference.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Christopher St John
Gesendet: Freitag, 10. Juli 2009 07:07
An: public-lod@w3.org
Betreff: "How to Publish Linked Data" vs "Dereferencing HTTP URIs"

I'm putting together a quick presentation on 303s and
Linked Data for the local Dallas Semantic Web Meetup
(it's part of a series of 10-minute lightning presentations)

I've got a couple of questions. They start out nitpicky
and pedantic (but I have some more entertaining ones
for later)

So, to start out with nitpicking: I've been reviewing the
various references, and noticed that

 "Dereferencing HTTP URIs"
 http://www.w3.org/2001/tag/doc/httpRange-14/HttpRange-14.html

appears in "How to Publish Linked Data on the Web".
But the latest version of "Dereferencing" is empty, with
a note that indicates that it's been abandoned in favor of:

 "Cool URIs for the Semantic Web"
 
https://gnowsis.opendfki.de/repos/gnowsis/papers/2006_11_concepturi/html/coo
luris_sweo_note.html

HtPLDotW also points to the 2006/11 version of CUftSW (albeit
at a different URL), but that version is an old draft. The newest
version is at:

 http://www.w3.org/TR/cooluris/

which is better (being at w3.org) but has language that indicates
that it is "just" a Note, and is not expected to become a
Recommendation.

So...

 Q1) Should HtPLDotW be updated to remove references to
 "Dereferencing HTTP URIs" in favor of "Cool URIs for the
 Semantic Web"

 Q2) The "Note" vs "Recommendation" thing is formal spec
 speak and may not mean what it appears to mean.
 Can someone comment? The wording "This is a draft
 document and may be updated, replaced or obsoleted
 by other documents at any time" in a Linked Data foundation
 document is fine if you're just experimenting, but could
 be alarming if you're considering, say, writing a commercial
 tool...

I did a (relatively quick) archive search but I could have
easily missed a discussion somewhere, apologies if this
has already been gone over. And thanks for your patience
with the geeky spec details.

-cks

-- 
Christopher St. John
c...@praxisbridge.com
http://praxissbridge.com
http://artofsystems.blogspot.com




Fusion Tables: Google's approach to sharing data on the Web

2009-07-03 Thread Chris Bizer
 

Hi all,

 

I’m regularly following Alon Halevy blog as I really like his thoughts on
dataspaces [1].

 

Today, I discovered this post about Google Fusion Tables

 

 

http://alonhalevy.blogspot.com/2009/06/fusion-tables-third-piece-of-puzzle.h
tml

 

“The main goal of Fusion Tables is to make it easier for people to create,
manage and share on structured data on the Web. Fusion Tables is a new kind
of data management system that focuses on features that enable
collaboration. […] In a nutshell, Fusion Tables enables you to upload
tabular data (up to 100MB per table) from spreadsheets and CSV files. You
can filter and aggregate the data and visualize it in several ways, such as
maps and time lines. The system will try to recognize columns that represent
geographical locations and suggest appropriate visualizations. To
collaborate, you can share a table with a select set of collaborators or
make it public. One of the reasons to collaborate is to enable fusing data
from multiple tables, which is a simple yet powerful form of data
integration. If you have a table about water resources in the countries of
the world, and I have data about the incidence of malaria in various
countries, we can fuse our data on the country column, and see our data side
by side.”

 

See also

 

Google announcement

http://googleresearch.blogspot.com/2009/06/google-fusion-tables.html

Water data example

http://www.circleofblue.org/waternews/2009/world/google-brings-water-data-to
-life/

 

Taken this together with Google Squared and the recent announcement that
Google is going to crawl microformats and RDFa,

it starts to look like the folks at Google are working in the same direction
as the Linking Open Data community, but as usual a bit more centralized and
less webish.

 

Cheers,

 

Chris

 

 

[1] http://www.cs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf

 

--

Prof. Dr. Christian Bizer

Web-based Systems Group

Freie Universität Berlin

+49 30 838 55509

  http://www.bizer.de

  ch...@bizer.de

 



AW: Wikipedia relicensed: consequences for DBpedia and downstream?

2009-06-16 Thread Chris Bizer
Hi Dan,

we will dual-license the next DBpedia release under CC-BY-SA and GFDL.

We would even be willing go for a more liberal license (for instance CC-BY),
I anybody with a legal background would assure us that we are allowed to do
so under US and European law.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Dan Brickley
Gesendet: Dienstag, 16. Juni 2009 09:38
An: public-lod@w3.org
Betreff: Wikipedia relicensed: consequences for DBpedia and downstream?

http://meta.wikimedia.org/wiki/Licensing_update/Implementation
[[
As per the licensing update vote result and subsequent Wikimedia 
Foundation Board resolution, any content on Wikimedia Foundation 
projects currently available under GFDL 1.2 with the possibility of 
upgrading to a later version will be made available additionally under 
Creative Commons Attribution/Share-Alike 3.0 Unported.

Specifically with regard to text, after this update, only dual-licensed 
content or CC-BY-SA-compatible content can be added to the projects, and 
GFDL-only submissions will no longer be accepted. In other words, 
CC-BY-SA will be the primary Wikimedia license for text, and GFDL will 
be retained as a secondary license.
]]

According to http://wiki.dbpedia.org/Datasets#h18-18 DBpedia is 
available under 
http://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_Li
cense

Will it also be made available under 
http://creativecommons.org/licenses/by-sa/3.0/ ? ("Attribution-Share 
Alike 3.0 Unported")

What do these distinctions mean in practice when we're dealing with 
mergable data rather than documents?

"Share Alike — If you alter, transform, or build upon this work, you may 
distribute the resulting work only under the same, similar or a 
compatible license."

... seems rather strong (eg. for intranet triplestore use).

Is anyone here not not a lawyer?

cheers,

Dan




AW: Chronicling America and Linked Data

2009-05-26 Thread Chris Bizer
Hi Ed,

sounds like a great new source of live Linked Data which is directly served
by the organization producing the data and not by university projects or
engaged individuals as it is still the case with many data sources in the
cloud.

Things are moving :-)

and I'm looking forward to the first applications that mashes up Chronicling
America data with DBpedia and Geonames.

Also a important signal for the libraries and digital archives community,
that you found OAI-ORE to be extremely useful.

Keep up the great work!

Cheers,

Chris


> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Ed Summers
> Gesendet: Dienstag, 26. Mai 2009 17:20
> An: public-lod@w3.org
> Betreff: Chronicling America and Linked Data
> 
> There is a new pool of linked-data up at the Library of Congress in
> the Chronicling America application [1]. Chronicling America is the
> web view on data collected for the National Digital Newspaper Program
> (NDNP). NDNP is a 20-year joint project of the National Endowment for
> the Humanities and the Library of Congress to digitize and aggregate
> historic newspaper in the United States.
> 
> Right now there are close to a million digitized newspaper pages
> available, and information about 140,000 newspaper titles...all of
> which have individual web views, for example:
> 
>  Newspaper Title: San Francisco Call [2]
>  Issue: San Francisco Call, 1895-03-05 [3]
>  Page: San Francisco Call, 1895-03-05, page sequence 1 [4]
> 
> If you view source on them you should be able to auto-discover the
> application/rdf+xml representations that bundle up information about
> the newspaper titles, issues and pages. You can also browse around
> using a linked data viewer like uriburner [5].
> 
> The implementation is a moving target, but you'll see we've cherry
> picked a few vocabularies: Dublin Core [6], Bibliographic Ontology
> [7], FOAF [8], and Object Reuse and Exchange (OAI-ORE) [9]. ORE in
> particular was extremely useful to us, since we wanted to enable the
> application's repository function, by exposing the digital objects
> (image files, ocr/xml files, pdfs) that make up the individual Page
> resources. For example:
> 
>  1#page>
> ore:aggregates
>  1.jp2>,
>  1.pdf>,
>  1/ocr.txt>,
>  1/ocr.xml>,
>  1/thumbnail.jpg>
> .
> 
> The idea is to enable the harvesting of these repository objects out
> of the Chronicling American webapp. The only links out we have so far
> are from Newspaper Titles to the geographic regions that they are
> "about", and languages. So for example:
> 
> 
> dcterms:coverage
> ,
>  ;
> dcterms:language  .
> 
> Just these minimal links provide a huge amount of data enrichment to
> our original data. We also needed to create a handful of new
> vocabulary terms, which we made available as RDFa [10]. I would be
> interested in any feedback you have. Also, please feel free to fire up
> linked-data bots to crawl the space.
> 
> //Ed
> 
> [1] http://chroniclingamerica.loc.gov
> [2] http://chroniclingamerica.loc.gov/lccn/sn85066387/
> [3] http://chroniclingamerica.loc.gov/lccn/sn85066387/1895-03-05/ed-1/
> [4] http://chroniclingamerica.loc.gov/lccn/sn85066387/1895-03-05/ed-
> 1/seq-1/
> [5]
> http://linkeddata.uriburner.com/about/html/http/chroniclingamerica.loc.
> gov/lccn/sn84026749%23title
> [6] http://dublincore.org/
> [7] http://bibliontology.com/
> [8] http://xmlns.com/foaf/spec/
> [9] http://www.openarchives.org/ore/1.0/vocabulary.html
> [10] http://chroniclingamerica.loc.gov/terms/




AW: fw: Google starts supporting RDFa -- 'rich snippets'

2009-05-13 Thread Chris Bizer
Hi Peter,

don't know. In a O'Reilly about Google's RDFa support, Guha says that they draw 
and plan to draw from existing vocabularies. 

"And we're not going to do this all by ourselves. As it is, we are drawing from 
several sources. We're drawing from microformats. We're drawing from vCard. And 
there are other places that you will see. And there's other people who know 
more about their topics than we could possibly know. And we'll draw on all of 
these things. So to come back and answer your question, we hope that the scope 
of this will be substantially more than the scope of all the particular data 
types that work today by microformats."

See http://radar.oreilly.com/2009/05/google-adds-microformat-parsin.html


Cheers

Chris


> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Peter Ansell
> Gesendet: Mittwoch, 13. Mai 2009 13:35
> An: Chris Bizer
> Cc: public-lod@w3.org
> Betreff: Re: fw: Google starts supporting RDFa -- 'rich snippets'
> 
> Unlike Yahoo SearchMonkey, Google has chosen to mock up their own
> ontologies instead of recognising existing vocabularies.
> 
> Cheers,
> 
> Peter
> 
> 2009/5/13 Chris Bizer :
> > Very nice.  After Yahoo SearchMonkey has been around for a while,
> things are
> > now also moving at Google.
> >
> >
> >
> > See:
> > http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-
> snippets.html
> >
> >
> >
> > And Ivan’s comment on it:
> >
> > http://ivan-herman.name/2009/05/13/rdfa-google/
> >
> >
> >
> > Cheers,
> >
> >
> >
> > Chris
> >
> >
> >
> >
> >
> > Von: public-semweb-lifesci-requ...@w3.org
> > [mailto:public-semweb-lifesci-requ...@w3.org] Im Auftrag von Matthias
> > Samwald
> > Gesendet: Mittwoch, 13. Mai 2009 08:48
> > An: public-semweb-lifesci
> > Betreff: Google starts supporting RDFa -- 'rich snippets'
> >
> >
> >
> > Quite preliminary, but still noteworthy. See
> > http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-
> snippets.html
> >
> >
> >
> > They are also searching for new  vocabularies and data sources that
> they can
> > potentially support, I guess they will soon support the popular
> vocabularies
> > (FOAF, SIOC etc.) that are also supported by Yahoo Search Monkey [1].
> Maybe
> > we (the HCLS IG) could come up with a biomedical demo scenario based
> on RDFa
> > and propose that to Google?
> >
> >
> >
> > [1]
> http://developer.yahoo.com/searchmonkey/smguide/profile_vocab.html
> >
> >
> >
> > Cheers,
> > Matthias Samwald
> >
> >
> >
> > DERI Galway, Ireland
> > http://deri.ie/
> >
> >
> >
> > Konrad Lorenz Institute for Evolution & Cognition Research, Austria
> > http://kli.ac.at/
> >
> >




fw: Google starts supporting RDFa -- 'rich snippets'

2009-05-13 Thread Chris Bizer
Very nice.  After Yahoo SearchMonkey has been around for a while, things are
now also moving at Google.

 

See:

http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets
.html

 

And Ivan's comment on it: 

http://ivan-herman.name/2009/05/13/rdfa-google/

 

Cheers,

 

Chris

 

 

Von: public-semweb-lifesci-requ...@w3.org
[mailto:public-semweb-lifesci-requ...@w3.org] Im Auftrag von Matthias
Samwald
Gesendet: Mittwoch, 13. Mai 2009 08:48
An: public-semweb-lifesci
Betreff: Google starts supporting RDFa -- 'rich snippets'

 

Quite preliminary, but still noteworthy. See
http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets
.html

 

They are also searching for new  vocabularies and data sources that they can
potentially support, I guess they will soon support the popular vocabularies
(FOAF, SIOC etc.) that are also supported by Yahoo Search Monkey [1]. Maybe
we (the HCLS IG) could come up with a biomedical demo scenario based on RDFa
and propose that to Google? 

 

[1] http://developer.yahoo.com/searchmonkey/smguide/profile_vocab.html

 

Cheers,
Matthias Samwald

 

DERI Galway, Ireland
http://deri.ie/

 

Konrad Lorenz Institute for Evolution & Cognition Research, Austria
http://kli.ac.at/

 



CfP: 7th Semantic Web Challenge - Open Track and Billion Triples Track

2009-04-28 Thread Chris Bizer
 combination with 
static information
- The results should be as accurate as possible (e.g. use a ranking of results 
according to context)
- There is support for multiple languages and accessibility on a range of 
devices

For the Billion Triples Track:
~~
The primary goal of the Billion triple track is to demonstrate applications 
that can work on Web scale using realistic Web-quality data.  The organizers 
therefore provide a billion triple large dataset that has been crawled from the 
Web and has to be used by the applications.  The functionality of the 
applications can involved anything from helping people figure out what is in 
the dataset via browsing, visualization, profiling, etc; could include 
inferencing that adds information not directly queriable in the original 
dataset; etc.

Submissions for the Billion Triples Track must meet the following minimum 
requirements:

1. The tool or application has to make use of at least a significant portion of 
the data provided by the organizers.
2. The tool or application is allowed to use other data that can be linked to 
the target dataset, but there is still an expectation that the primary focus 
will be on the data provided.
3. The tool or application does not have to be specifically an end-user 
application, as defined for the Open Track Challenge, but usability is a 
concern.  The key goal is to demonstrate an interaction with the large data-set 
driven by a user or an application.  However, given the scale of this 
challenge, solutions that can be justified as leading to such applications, or 
as crucial to the success of future applications, will be considered.

It is desired that all applications assume an open world, i.e. that the 
information is never complete.  However, applications that can show useful ways 
to "close the world" for sections of the dataset will be considered.

Additional Desirable Features
-
In addition to the above minimum requirements, we note other desirable features 
that will be used as criteria to evaluate submissions.
-  The application should do more than simply store/retrieve large numbers of 
triples
-  The application or tool(s) should be scalable (in terms of  the amount  of 
data used and in terms of distributed components working together)
-  The application or tool(s) should show the use of the very large, mixed 
quality data set
-  The application should either function in real-time or, if pre-computation 
is needed, have a real-time realization (but we will take a wide view of "real 
time" depending on the scale of what is done)

How to participate
--
Visit http://challenge.semanticweb.org in order to participate and register for 
the Semantic Web Challenge by submitting the required information as well as a 
link to the application on the online registration form. The form will be open 
until October 1, 2009, 12am CET. 

The requirements of this entry are:

1) Abstract: no more than 200 words.
2) Description: The description will show details of the system including why 
the system is innovative, which features or functions the system provides, what 
design choices were made and what lessons were learned. Papers should not 
exceed eight pages and must be formatted according to the same guidelines as 
the papers in the Research Track (see http://iswc2009.semanticweb.org/).
3) Web access: The application should be accessible via the web. If the 
application is not publicly accessible, passwords should be provided. We also 
ask to provide a (short) instruction on how to start and use the application.

Descriptions will be published in the form of an online proceedings.

Prizes
--

A prize in money will be provided to the winners along with publicity for their 
work: 
1. Prize: 1.000 € 
2. Prize: 500 € 
3. Prize: 250 € 

The winners will also be asked to give a live demonstration of their 
application at the ISWC 2009 conference. The winners will also be asked to give 
a live demonstration of their application at the ISWC 2009 conference. The best 
applications will also have a chance to appear as full papers in the Journal of 
Web Semantics.

In the event that one of the tracks receive less than a minimal number of 
submissions, the organizers reserve the right to merge the two tracks of the 
competition.

IMPORTANT DATES
- -
October 1, 2009: Submissions due
October 25-29, 2009: ISWC 2009 Technical Program

SWC Co-Chairs
-
Chris Bizer (Freie Universität Berlin)
Peter Mika (Yahoo! Research Barcelona)


Contact:

Peter Mika 
Yahoo! Research Barcelona 
Avinguda Diagonal 177, 8th floor 
Barcelona, 08018 
Catalunya, Spain 
(Phone) +34 93 183-8846 
(Fax) + 34 93 183-8901 
Email: pmika at yahoo-inc.com 
Web: http://www.cs.vu.nl/~pmika/


Cheers,

Chris Bizer and Peter Mika






Madrid/WWW2009 Linked Data Community Gathering

2009-04-01 Thread Chris Bizer
Hi all,

 

there will be again a Linked Data Community Gathering at WWW2009 in Madrid
on 20th April 2009, after the LDOW2009 workshop
http://events.linkeddata.org/ldow2009/.

 

The gathering will take place in a nearby restaurant/pub and will include a
mixture of beer, food and lightning talks. 

We are meeting at the lobby of the conference center at 7pm. 

 

If you want to join in, please add your name to the list at

 

http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/
MadridGathering

 

See you in Madrid and let's hope that the gathering will be as entertaining
and inspiring as our last meeting in Karlsruhe J

 

Cheers,

 

Chris

 

 

 

 

 



AW: sanity checking the LOD Cloud statistics - Please add the statistics for your dataset to the Wiki

2009-04-01 Thread Chris Bizer
Hi Ted,

good that you raise this topic. 

The statistics were added to the wiki by Anja and reflect her
knowledge/guesses about the size of the datasets and the numbers of links
between them. And of course, some of her guesses might be wrong.  

In an ideal world, these statistics would be provided by Semantic Web search
engines that crawl the cloud and calculate the statistics afterwards based
on what they actually got from the Web. Alternatively, all dataset providers
could publish Void descriptions of their datasets which could also be used
to generate the statistics.

But as the search engines have not yet reached this point and as Void is
also not used by all data providers, we thought it would be useful to put
these statistics as a starting point into the Wiki so that people
(especially data set publishers) can update them and we can use them when we
draw the LOD cloud the next time.

I have updated the statistics about outgoing links connecting DBpedia with
other datasets yesterday. 

If everybody on this list would do the same for the data sources they
maintain/use, I think we will get a much more accurate LOD diagram the next
time we draw it.

So, please: Take 5 minutes and quickly add the actual statistics about your
datasets to

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s/Statistics
(size of your dataset)

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s/LinkStatistics
(number of links connecting your dataset with other datasets)

Thanks a lot in advance!

Cheers

Chris




> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Ted Thibodeau Jr
> Gesendet: Mittwoch, 1. April 2009 08:06
> An: public-lod@w3.org
> Betreff: sanity checking the LOD Cloud statistics
> 
> Hello, all --
> 
> I've had a few minutes to start working to update my version [1] of the
> LOD Cloud diagram [2], which means I got to start looking at the Data
> Set Statistics [3] and Link Statistics [4] pages.
> 
> I have found a number of apparent discrepancies.  I'm not sure where
> these
> came from, but I think they need attention and correction.
> 
> [3] gave some round, and some exact values.  It's not at all clear
> whether
> these values were originally intended to reflect triple-counts in the
> data
> set, URIs minted there (i.e., Entities named there), or something else
> entirely.  I think the page holds a mix of these, which makes them
> rather
> troublesome as a source of comparison between data sets.
> 
> [4] had few exact values, which appear to have been incorrectly added
> there,
> and apparently means to use only 3 "counts" for the inter-set linkages
> --
> "> 100", "> 1000" "> 100.000".  Clearly, the last means more-than-one-
> hundred-thousand -- because the first clearly means more-than-one-
> hundred --
> but this was not obvious at first glance, given my US-training that the
> period is used for the decimal, not for the thousands delimiter.
> 
> First thing, therefor, I suggest that all period-delimiters on [4]
> change
> to comma-delimiters, to match the first page.  (I've actually made this
> change, but incorrect values may well remain -- please read on.)
> 
> I think it also makes sense to add "> 10,000", and "> 1,000,000" to the
> values here.  Just looking at the DBpedia "actual counts" which were on
> the page, it's clear that a log-scale comparing the interlinkage levels
> presents a better picture than the three arbitrarily chosen levels.
> (Again, I've started using these as relevant.)
> 
> 
> Now to the discrepancies.  From [3], I got this line --
> 
>    BBC Playcount Data  10,000
> 
> At first read, I thought that meant 10,000 triples.  But [4] indicated
> these external link counts for BBC Playcount Data --
> 
> BBC Programmes > 100.000
>   Musicbrainz> 100.000
> 
> I don't see a way for 10,000 triples to include 200,000 external links.
> That means that the first count must be of Entities.  But going to the
> BBC Playcount home page [5], I found --
> 
> Triple count1,954,786
> Distinct BBC Programmes resources   6,863
> Distinct Musicbrainz resources  7,055
> 
> An obvious missing number here is a count of minted URIs -- that is, of
> BBC Playcount resources/entities -- but I also learned that BBC
> Playcount
> URIs are not pointers-to-values, but values-in-themselves.  The count
> is
> *embedded* in the URI (and thus, if a count changes, the URI changes!)
> --
> 
> A playcount URI in this service looks like:
> 
>http://dbtune.org/bbc/playcount/_
> 
> Where  is the id of the episode or the brand, as in /
> programmes BBC
> catalogue, and  is a number between 0 and the number of
> playcounts
> for the episode or the brand.
> 
> If we accept this URI construction

Linked Data on the Web (LDOW2009) workshop papers online.

2009-03-17 Thread Chris Bizer
Hi all,

 

we  are happy to announce that the papers of this year's Linked Data on the
Web workshop are online now and can be accessed at

 

http://events.linkeddata.org/ldow2009/

 

Looking at the program, we think that LDOW2009 is going to be again a
exciting event.

 

Congratulations to the authors and lots of thanks to the members of the LDOW
program committee for all their tough reviewing work!

 

We are looking forward to see you in Madrid.

 

Cheers,

 

Chris Bizer, Tom Heath, Tim Berners-Lee, Kingsley Idehen 

(LDOW 2009 Organizing Committee)

 



New LOD Cloud

2009-03-05 Thread Chris Bizer
Not for forget the new colored-by-topic version of the diagram at

http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05_colored.pn
g

Cheers

Chris


> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Anja Jentzsch
> Gesendet: Donnerstag, 5. März 2009 16:56
> An: public-lod@w3.org
> Betreff: Re: New LOD Cloud - Please send us links to missing data
> sources
> 
> Hi all,
> 
> thanks for all your input.
> 
> The LOD Cloud as of March 2009 is final and online.
> 
> You can find it over at
> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpen
> Data
> along with a colored by topic version and various formats.
> 
> I will update the dataset table and put a linkage table on the dataset
> page later today. It would be extremely useful keeping these tables up
> to date.
> 
> Anja
> 
> Anja Jentzsch schrieb:
> > Hi all,
> >
> > we are currently updating the LOD cloud. Find the draft here:
> > http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-02-27.png
> >
> > We have already added:
> >
> > 1. the RKBExplorer cloud
> > 2. the Bio2RDF cloud
> > 3. the LODD cloud
> > 4. GeoSpecies
> > 5. LIBRIS
> >
> > Statistics on triple and link count (as well as target sources) are
> > missing for the following sources:
> > - Faviki
> > - RDFohloh
> > - OpenCalais
> > - LIBRIS
> >
> > Did we forget any new data sources or links between data sources?
> >
> > Keep in mind: A data source qualifies for the cloud, if the data is
> > available via dereferencable URIs and if the data source is
> interlinked
> > with at least one other source (meaning it references URIs within the
> > namespace of the other source).
> >
> > Anja
> >




AW: ANN: Silk - Link Discovery Framework for the Web of Data released.

2009-03-02 Thread Chris Bizer
Hi Stephane,

 

I would say:

 

Silk is about discovering data links (finding out that two data sources talk
about the same real world entity, or that there is a specific other semantic
relation between entities in different data sources).

VoiD is about describing (providing meta-information about) the links that
you have discovered.

 

So Silk and Void play nicely together and a workflow for a data publisher
could be:

 

1.   Publish his dataset.

2.   Use Silk to discover links between his data source to other data
sources on the Web.

3.   Publish these data links together with a Void description on the
Web.

 

In order to support people in using Void, we are thinking about extending
Silk with the ability to output a basic Void description about the
discovered linkset.

 

Cheers,

 

Chris

 

 

Von: Stephane Fellah [mailto:fella...@gmail.com] 
Gesendet: Montag, 2. März 2009 18:58
An: Chris Bizer
Cc: public-lod@w3.org; Semantic Web;
dbpedia-discuss...@lists.sourceforge.net
Betreff: Re: ANN: Silk - Link Discovery Framework for the Web of Data
released.

 

Chris,

 

I welcome this initiative. Could you explain how your approach differs from
the VoiD initiative  <http://semanticweb.org/wiki/VoiD>
http://semanticweb.org/wiki/VoiD

 

Best regards

Stephane Fellah

 

On Mon, Mar 2, 2009 at 10:14 AM, Chris Bizer  wrote:

Hi all,

 

we are happy to announce the initial public release of Silk, a link
discovery framework for the Web of Data.

 

The Web of Data is built upon two simple ideas: Employ the RDF data model to
publish structured data on the Web and to set explicit RDF links between
entities within different data sources. While there are more and more tools
available for publishing Linked Data on the Web, there is still a lack of
tools that support data publishers in setting RDF links to other data
sources on the Web. With the Silk - Link Discovery Framework, we hope to
contribute to filling this gap.

 

Using the declarative Silk – Link Specification Language (Silk-LSL), data
publishers can specify which types of RDF links should be discovered between
data sources and which conditions data items must fulfill in order to be
interlinked. These link conditions can apply different similarity metrics to
multiple properties of an entity or related entities which are addressed
using a path-based selector language. The resulting similarity scores can be
weighted and combined using various similarity aggregation functions. Silk
accesses data sources via the SPARQL protocol and can thus be used to
discover links between local and remote data sources.

 

The main features of the Silk framework are:

- it supports the generation of owl:sameAs links as well as other types of
RDF links. 

- it provides a flexible, declarative language for specifying link
conditions. 

- it can be employed in distributed environments without having to replicate
datasets locally. 

- it can be used in situations where terms from different vocabularies are
mixed and where no consistent RDFS or OWL schemata exist. 

- it implements various caching, indexing and entity pre-selection methods
to increase performance and reduce network load.

 

More information about Silk, the Silk-LSL language specification, as well as
several examples that demonstrate how Silk is used to set links between
different data sources in the LOD cloud is found at:

 

 <http://www4.wiwiss.fu-berlin.de/bizer/silk/>
http://www4.wiwiss.fu-berlin.de/bizer/silk/

 

The Silk framework is provided under the terms of the BSD license and can be
downloaded from

 

 <http://code.google.com/p/silk/> http://code.google.com/p/silk/

 

Happy linking,

 

Julius Volz, Christian Bizer

 

 

 

 



ANN: Silk - Link Discovery Framework for the Web of Data released.

2009-03-02 Thread Chris Bizer
Hi all,

 

we are happy to announce the initial public release of Silk, a link
discovery framework for the Web of Data.

 

The Web of Data is built upon two simple ideas: Employ the RDF data model to
publish structured data on the Web and to set explicit RDF links between
entities within different data sources. While there are more and more tools
available for publishing Linked Data on the Web, there is still a lack of
tools that support data publishers in setting RDF links to other data
sources on the Web. With the Silk - Link Discovery Framework, we hope to
contribute to filling this gap.

 

Using the declarative Silk - Link Specification Language (Silk-LSL), data
publishers can specify which types of RDF links should be discovered between
data sources and which conditions data items must fulfill in order to be
interlinked. These link conditions can apply different similarity metrics to
multiple properties of an entity or related entities which are addressed
using a path-based selector language. The resulting similarity scores can be
weighted and combined using various similarity aggregation functions. Silk
accesses data sources via the SPARQL protocol and can thus be used to
discover links between local and remote data sources.

 

The main features of the Silk framework are:

- it supports the generation of owl:sameAs links as well as other types of
RDF links. 

- it provides a flexible, declarative language for specifying link
conditions. 

- it can be employed in distributed environments without having to replicate
datasets locally. 

- it can be used in situations where terms from different vocabularies are
mixed and where no consistent RDFS or OWL schemata exist. 

- it implements various caching, indexing and entity pre-selection methods
to increase performance and reduce network load.

 

More information about Silk, the Silk-LSL language specification, as well as
several examples that demonstrate how Silk is used to set links between
different data sources in the LOD cloud is found at:

 

 
http://www4.wiwiss.fu-berlin.de/bizer/silk/

 

The Silk framework is provided under the terms of the BSD license and can be
downloaded from

 

  http://code.google.com/p/silk/

 

Happy linking,

 

Julius Volz, Christian Bizer

 

 

 



AW: New LOD Cloud - Please send us links to missing data sources

2009-02-28 Thread Chris Bizer
Hi Kingsley,

> You have MySpace and Flickr Wrappers but still don't include all the 
> Virtuoso Sponger Cartridges (which are wrappers) to the Cloud [1] ?

The last time I checked the cartridges, I had the impression that they were
not very much interlinked with the rest of the LOD cloud and about half of
them were down.

The links to DBpedia were rather strange. For instance the first link I
found owl:sameAs'ed Yahoo finance balance sheet with Dbpedia bed sheet,
which is even for me as a big supporter of owl:sameAs links a bit too much
Semantic gap.

Did you improve the quality of the external links in the meantime and do the
cartridges regularly deliver data?

Another problem is that many of the sources are not really Open Data as
various license restrictions apply.

Cheers

Chris


-Ursprüngliche Nachricht-
Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag
von Kingsley Idehen
Gesendet: Samstag, 28. Februar 2009 00:18
An: Anja Jentzsch
Cc: public-lod@w3.org
Betreff: Re: New LOD Cloud - Please send us links to missing data sources

Anja Jentzsch wrote:
> Hi all,
>
> we are currently updating the LOD cloud. Find the draft here: 
> http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-02-27.png
>
> We have already added:
>
> 1. the RKBExplorer cloud
> 2. the Bio2RDF cloud
> 3. the LODD cloud
> 4. GeoSpecies
> 5. LIBRIS
>
> Statistics on triple and link count (as well as target sources) are 
> missing for the following sources:
> - Faviki
> - RDFohloh
> - OpenCalais
> - LIBRIS
>
> Did we forget any new data sources or links between data sources?
>
> Keep in mind: A data source qualifies for the cloud, if the data is 
> available via dereferencable URIs and if the data source is 
> interlinked with at least one other source (meaning it references URIs 
> within the namespace of the other source).
>
> Anja
>
>
Anja,

You have MySpace and Flickr Wrappers but still don't include all the 
Virtuoso Sponger Cartridges (which are wrappers) to the Cloud [1] ?

Also, the LODD data sets page should be linked to: 
http://esw.w3.org/topic/DataSetRDFDumps, so we can track down the dumps 
with ease re. the Virtuoso LOD hosting instance.

links:

1. http://virtuoso.openlinksw.com/images/sponger-cloud.html

-- 


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software Web: http://www.openlinksw.com








ANN: D2R Server and D2RQ V0.6 released.

2009-02-19 Thread Chris Bizer
Hi all,

we are happy to announce the release of D2R Server and D2RQ Version 0.6 and
recommend all users to replace old installations with the new release.

The new release features:

1. significantly better performance due to an improved SPARQL-to-SQL
rewriting algorithm. First experiments with the Berlin SPARQL Benchmark
showed a factor 7 speedup.
2. D2R Server now supports dereferencing vocabulary URIs and the publication
of vocabulary mappings such as owl:equivialentClass.
3. Oracle and PostgreSQL support. Beside of MySQL, we tested the new release
with Oracle and PostgreSQL as underlying RDBMS using the Berlin SPARQL
Benchmark qualification test.
4. lots of minor and mayor bug fixes.

More information about the tools is found on the

1. D2RQ Platform website: http://www4.wiwiss.fu-berlin.de/bizer/d2rq/
2. D2R Server website: http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/

The new releases can be downloaded from Sourceforge

http://sourceforge.net/projects/d2rq-map/

Lots of thanks for their magnificent work to:

1. Andreas Langegger and Herwig Leimer (Johannes Kepler Universität Linz)
for optimizing the SPARQL-to-SQL rewriting algorithm.
2. Christian Becker (Freie Universität Berlin) for his work on vocabulary
serving in D2R Server, fixing various bugs, testing V0.6 with Oracle and
PostgreSQL, and putting the V0.6 release together. 
2. Richard Cyganiak (DERI Galway / Freie Universität Berlin) for
coordinating the work around the release, fixing lots of bugs, and
supporting the user community on the D2RQ mailing list for the last 2 years.


Please send feedback about the new release to the D2RQ and D2R Server
mailing list 

https://lists.sourceforge.net/lists/listinfo/d2rq-map-devel

Have fun with D2RQ and D2R Server!

Cheers,

Chris

--
Chris Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
ch...@bizer.de





CFP: 5th Workshop on Scripting for the Semantic Web (SFSW09), co-located with ESWC09

2009-01-22 Thread Chris Bizer
--

CALL FOR PAPERS

--

 

5th Workshop on Scripting and Development for the Semantic Web - colocated
with the 6th European Semantic Web Conference May/June, 2009, Crete, Greece

 

 <http://semanticscripting.org/SFSW2009/>
http://semanticscripting.org/SFSW2009/

 

 

--

Objectives

--

 

On the current Semantic Web there is an ever increasing need for
lightweight, flexible solutions for doing publishing, presentation,
transformation, integration and general manipulation of data for supporting
and making use of the increasing number of deployed open linked datasets and
publicly available semantic applications.

Communication and architectural standards such as AJAX, REST, JSON already
cater to this need of flexible, lightweight solutions, and they are well
supported by scripting languages such as PHP, JavaScript, Ruby, Python,
Perl, JSP and ActionScript.

 

This workshop is concerned with the exchange of tools, experiences and
technologies for the development of such lightweight tools, especially
focusing on the use of scripting languages. Last year's workshop focused on
the creation of Semantic Web data through social interactions as well as
applications that integrate socially-created data across communities.

Keeping in step with the increasing number of semantically enabled web-
sites for public consumption, this year's focus is bringing the semantic web
applications to the main-stream: everything from improving the user
experience for browsing and accessing data, through integrating with
existing non-semantic services, to quickly and cheaply porting such services
to using a Semantic Web architecture.

 

The workshop will follow the tradition and include a scripting challenge
which will award an industry sponsored prize to the most innovative
scripting application.

 

--

Topics of Interest

--

 

Topics of interest include, but are not limited to:

 

Infrastructure:

- Lightweight Semantic Web frameworks and APIs

- Lightweight implementations of RDF repositories, query languages, and

  reasoning engines

- Semantic Web publishing and data syndication frameworks

- Approaches to crawling Web data and querying distributed data on the

  Web

 

Applications:

- Lightweight and flexible Semantic Web applications

- Approaches to RDF-izing existing applications, such as RDFa,

  microformats, or GRDDL

- Mashups that provide RDF views on Web 2.0 data sources such as Google,

  Yahoo, Amazon, or eBay

- Wikis, weblogs, data syndication and content management applications

  using RDF

- RDF/OWL editors and authoring environments

- Scripting applications for visualizing Web data

- Semantic Web Mining and Social Network Analysis

- Mashups that demonstrate the novel capabilities of Semantic Web
technologies

 

Conceptual:

- Rapid development techniques for the Semantic Web

- Rapid migration of web-applications to the Semantic Web

- Employment of scripting language characteristics for Semantic Web

  development

- Scalability and benchmarks of Semantic Web scripting applications

 

--

Scripting Challenge

--

 

As in previous years, there will be a scripting challenge awarding a prize
for the most innovating small scripting application or mashup.

 

Details will appear soon on the workshop webpage and in a separate CFP

 

--

Submissions

--

We seek three kinds of submissions:

 

- Full papers - should not exceed 12 pages in length.

- Short papers - are expected up to 6 pages.

- Scripting Challenge Submissions - 2 page description of the

  application, ideally accompanied with the source code and a link to an

  online demo.

 

-

Important Dates

--

 

Submission deadline:

March 7, 2009

Notification of acceptance:

April 4, 2009

Camera-ready paper submission:

April 18, 2009

 

--

WORKSHOP CHAIRS

--

 

* Gunnar Aastrand Grimnes, DFKI Knowledge Management Lab, Germany

* Chris Bizer, Freie Universität Berlin, Germany

* Sören Auer, Universität Leipzig, Germany

 

--

Additional Information

--

 

Additional information can be found on the workshop webpage at:

 

 <http://semanticscripting.org/SFSW2009/>
http://semantic

2nd CfP: Linked Data on the Web Workshop (LDOW2009) at WWW2009, Madrid, Spain

2009-01-14 Thread Chris Bizer

---
CALL FOR PAPERS
---

LINKED DATA ON THE WEB (LDOW2009)
Workshop at WWW2009, April 2009, Madrid, Spain

Submission Deadline: February 7th 2009

http://events.linkeddata.org/ldow2009/


OVERVIEW


The Web is increasingly understood as a global information space
consisting not just of linked documents, but also of linked data. More
than just a vision, the Web of Data has been brought into being by the
maturing of the Semantic Web technology stack, and by the publication
of an increasing number of datasets according to the principles of
Linked Data. Today, this emerging Web of Data includes data sets as
extensive and diverse as DBpedia, Geonames, US Census, EuroStat,
MusicBrainz, BBC Programmes, Flickr, DBLP, PubMed, UniProt, FOAF,
SIOC, OpenCyc, UMBEL and Yago. The availability of these and many
other data sets has paved the way for an increasing number of
applications that build on Linked Data, support services designed to
reduce the complexity of integrating heterogeneous data from
distributed sources, as well as new business opportunities for
start-up companies in this space.

Building on the success of last year's LDOW workshop at WWW2008 in
Beijing, the LDOW2009 workshop aims to provide a forum for presenting
the latest research on Linked Data and drive forward the research
agenda in this area. While last year's workshop focused on the
publication of Linked Data, this year's workshop will focus on Linked
Data application architectures, linking algorithms and Web data
fusion.


TOPICS OF INTEREST


Topics of interest for the workshop include, but are not limited to,
the following:

* Data Linking and Fusion
o linking algorithms and heuristics, identity resolution
o Web data integration and data fusion
o evaluating quality and trustworthiness of Linked Data

* Linked Data Application Architectures
o crawling, caching and querying Linked Data on the Web;
optimizations, performance
o Linked Data browsers, search engines
o applications that exploit distributed Web datasets

* Data Publishing
o tools for publishing large data sources as Linked Data on the
Web (e.g. relational databases, XML repositories)
o embedding data into classic Web documents (e.g. GRDDL, RDFa,
Microformats)
o licensing and provenance tracking issues in Linked Data publishing
o business models for Linked Data publishing and consumption


SUBMISSIONS


We seek three kinds of submissions:

* Full technical papers: up to 10 pages in ACM format
* Short technical and position papers: up to 5 pages in ACM format
* Demo description: up to 2 pages in ACM format

Submissions must be formatted according to the ACM SIG Proceedings
Templates. Submissions will be peer reviewed by three independent
reviewers. Accepted papers will be presented at the workshop and
included in the workshop proceedings.

Please submit your papers via EasyChair at 
http://www.easychair.org/conferences/?conf=ldow2009

Proceedings will be published online at CEUR-WS.


IMPORTANT DATES


* Submission deadline: February 7th 2009
* Notification of acceptance: February 23rd 2009
* Camera-ready versions of accepted papers: March 7th 2009


ORGANISING COMMITTEE


* Christian Bizer, Freie Universität Berlin, Germany
* Tom Heath, Talis Information Ltd., UK
* Tim Berners-Lee, W3C/MIT, USA
* Kingsley Idehen, OpenLink Software, USA


PROGRAMME COMMITTEE


* Alan Ruttenberg, Science Commons, USA 
* Andreas Harth, DERI, Ireland 
* Andy Seaborne, Hewlett Packard Labs, UK 
* Bernard Vatant, Mondeca, France 
* David Peterson, Boab Interactive, Australia 
* Denny Vrandecic, University of Karlsruhe, Germany 
* Eyal Oren, VU Amsterdam, Netherlands 
* Frédérick Giasson, Structured Dynamics, Canada 
* Georgi Kobilarov, Freie Universität Berlin, Germany 
* Giovanni Tummarello, DERI Galway, Ireland 
* Gong Cheng, Southeast University, China 
* Harith Alani, University of Southampton, UK 
* Harry Halpin, University of Edinburgh, UK 
* Hugh Glaser, University of Southampton, UK 
* Ian Davis, Talis, UK 
* Ivan Herman, World Wide Web Consortium, USA 
* Jamie Taylor, Metaweb, USA 
* Jim Hendler, RPI, USA 
* Jonathan Gray, Open Knowledge Foundation, UK 
* Jun Zhao, Oxford University, UK 
* Juan Sequeda, UT Austin, USA 
* Knud Möller, DERI Galway, Ireland 
* Mariano Consens, University of Toronto, Canada 
* Martin Hepp, Universität der Bundeswehr München, Germany 
* Mathieu d'Aquin, The Open University, UK 
* Michael Bergman, Structured Dynamics, USA 
* Michael Hausenblas, DERI, Ireland 
* Michiel Hildebrand, CWI, Netherlands 
* Mischa Tuffield, Garlik, UK 
* Oktie Hassanzadeh, University of Toronto, Canada 
* Olaf Hartig, Humboldt University Berlin, Germany 
* Orri Erling, OpenLink So

IJSWIS Special Issue on Linked Data - Deadline Extension

2009-01-06 Thread Chris Bizer

Hi all,

due to multiple requests, we extend the submission deadline for the IJSWIS
Special Issue on Linked Data from January 7, 2009 to January 26, 2009.

Detailed information about the special issue as well as the submission
process is found below and on the special issue webpage: 

http://www.ijswis.org/?q=node/29

Cheers,

Chris


Introduction
---

The Web is increasingly understood as a global information space, consisting
not just of linked documents, but also of linked data. In addition to the
maturing of the Semantic Web technology stack, a major catalyst in this
transition has been the application of the Linked Data principles [1],
hand-in-hand with the publication and dense, mutual interlinking of
large-scale data sets distributed across the Web [2]. This movement has
brought the vision of a “Web of Data” closer to realization than ever
before.
However, the emergence of Linked Data on a Web scale raises numerous novel
and significant research challenges that touch on both the “semantics” and
“Web” aspects of the Semantic Web vision. These challenges are diverse in
nature and rang from algorithmic approaches for linking and fusing Web data,
over those of providing user applications on top of distributed and
heterogeneous data sets, to social and business questions related to the
production and consumption of Linked Data. Building on successful events in
the field, such as the 1st Workshop on Linked Data on the Web (LDOW2008)
[3], the goal of this special issue is to solicit high quality, original
research contributions on all aspects of Linked Data, thereby capturing the
state of the art and stimulating further developments in this and related
areas.

Topics
--

Topics of interest for this IJSWIS special issue include, but are not
limited to:

+ Data Linking and Fusion
   - Identity resolution 
   - Linking algorithms and heuristics 
   - Data fusion and integration 

+ Linked Data Application Architectures
   - Crawling, caching, and querying Linked Data from the Web
   - Evaluating the quality , trustworthiness, and task-
 appropriateness of Web data
   - Reasoning with and over Web data
   - User-facing applications that exploit Linked Data
  - Linked Data browsers and analysis interfaces
  - Linked Data search engines and query interfaces
  - User interaction and interface issues in Linked Data applications
   - Publishing legacy data sources as Linked Data on the Web
   - Publishing user-generated content as Linked Data on the Web 
 
+ Business Models and Social Aspects
   - Business models for Linked Data publishing and consumption
   - Licensing and other legal issues in Linked Data publishing
   - Authority and provenance tracking
   - Privacy and the Web of Data


Submission Process
--

Submissions to this special issue should follow the journal's guidelines for
submission
(http://www.idea-group.com/journals/details.asp?ID=4625&v=guidelines). After
submitting a paper, please also inform the guest editors by email,
indicating the paper ID assigned by the submission system. Papers must be of
high quality and should clearly state the technical issue(s) being addressed
as related to Linked Data on the Web. Research papers should present a proof
of concept for any novel technique they are proposing. Wherever possible,
submissions should demonstrate the contribution of the research by reporting
on a systematic evaluation of the work. If a submission is based on a prior
publication in a workshop or conference, the journal submission must involve
substantial advance (a minimum of 30%) in conceptual terms as well as in
exposition (e.g., more comprehensive testing/evaluation/validation or
additional applications/usage). If this applies to your submission, please
explicitly reveal the relevant previous publications.

All papers must be submitted by January 7, 2009. The recommended length of
submitted papers is between 5,500 to 8,000 words. All papers are subject to
peer review performed by at least three established researchers drawn from a
panel of experts selected for this special issue. Accepted papers will
undergo for a second cycle of revision and reviewer feedback. Please submit
manuscripts as a PDF file using the online submission system.

The International Journal on Semantic Web and Information Systems (IJSWIS)
is the first Semantic Web journal to be included in the Thomson ISI citation
index. More information on the journal can be found at
http://www.ijswis.org. 


Important Dates
---

+ NEW DEADLINE January 26, 2009: Submission deadline
+ February 27, 2009: Notification of acceptance
+ March 27, 2009: Camera-ready papers due
+ Second quarter of 2009: Publication

Special Issue Organizing Committee
--

+ Chris Bizer (Freie Universität Berlin, Germany)
+ Tom Heath (Talis Information Ltd, United Kingdom)
+ Martin Hepp (Universität der Bundeswehr München, Germany)


References
--

2nd CfP: IJSWIS Special Issue on Linked Data

2008-11-27 Thread Chris Bizer
Committee
--

+ Chris Bizer (Freie Universität Berlin, Germany)
+ Tom Heath (Talis Information Ltd, United Kingdom)
+ Martin Hepp (Universität der Bundeswehr München, Germany)


References
---

[1] Berners-Lee, Tim (2006) Design Issues: Linked Data.   
http://www.w3.org/DesignIssues/LinkedData.html
[2] W3C SWEO Linking Open Data Project (2008) 
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/
[3] 1st Workshop on Linked Data on the Web (LDOW2008), WWW2008, Beijing,
China http://events.linkeddata.org/ldow2008

--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
[EMAIL PROTECTED]





AW: Domain and range are useful Re: DBpedia 3.2 release, including DBpedia Ontology and RDF links to Freebase

2008-11-19 Thread Chris Bizer

Hi Dan and all,

it looks to me as we try to solve a variety of different use cases with a
single solution and thus run into problems here.

There are three separate use cases that people participating in the
discussion seem to have in mind:

1. Visualization of the data
2. Consistency checking
3. Interlinking ontologies/schemata on the Web as basis for data integration


For visualization, range and domain constrains are somehow useful (as TimBL
said), but this usefulness is very indirect.
For instance, even simple visualizations will need to put the large number
of DBpedia properties into a proper order and ideally would also support
views on different levels of detail. Both things where range and domain
don't help much, but which are covered by other technologies like Fresnel
(http://www.w3.org/2005/04/fresnel-info/manual/). So for visualization, I
think it would be more useful if we would start publishing Fresnel lenses
for each class in the Dbpedia ontology.

As Jens said, the domains and ranges can be used for checking instance data
against the class definitions and thus detect inconsistencies (this usage is
not really covered by the RDFS specification as Paul remarked, but still
many people do this). As Wikipedia contains a lot of inconsistencies and as
we don't want to reduce the amount of extracted information too much, we
decided to publish the loose instance dataset which also contains property
values that might violate the contrains. I say "might" as we only know for
sure that something is a person if the Wikipedia article contains a
person-related template. If it does not, the thing could be a person or not.

Which raises the question: Is it better for DBpedia to keep the constraints
and publish instance data that might violate these constraints or is it
better to loosen the constraints and remove the inconsistencies this way? Or
keep things as they are, knowing that range and domain statements are anyway
hardly used by existing Semantic Web applications that work with data from
the public Web? (Are there any? FalconS?)

For the third use case of interlinking ontologies/schemata on the Web in
order to integrate instance data afterwards, it could be better to remove
the domain and range statements as this prevents inconsistencies when
ontologies/schemata are interlinked. On the other hand it is likely that the
trust layers of Web data integration frameworks will ignore the domain and
range statements anyway and concentrate more on owl:sameAs, subclass and
subproperty. Again, Falcons and Sindice and SWSE teams, do you use domain
and range statements when cleaning up the data that you crawled from the
Web?

I really like Hugh's idea of having a loose schema in general and add
additional constraints as comments/optional constraints to the schema, so
that applications can decide whether they want to use them or not. But this
is sadly not supported by the RDF standards.

So, I'm still a bit undecided about leaving or removing the ranges and
domains. Maybe leave them, as they are likely not harmful and might be
useful for some use cases?

Cheers

Chris


> -Ursprüngliche Nachricht-
> Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Im Auftrag von Dan Brickley
> Gesendet: Mittwoch, 19. November 2008 14:09
> An: Pierre-Antoine Champin
> Cc: Paul Gearon; Semantic Web
> Betreff: Re: Domain and range are useful Re: DBpedia 3.2 release,
> including DBpedia Ontology and RDF links to Freebase
> 
> 
> Pierre-Antoine Champin wrote:
> > Paul Gearon a écrit :
> >> While I'm here, I also noticed Tim Finin referring to "domain and
> range
> >> constraints". Personally, I don't see the word "constraint" as an
> >> appropriate description, since rdfs:domain and rdfs:range are not
> >> constraining in any way.
> >
> > They are constraining the set of interpretations that are models of
> your
> > knowledge base. Namely, you constrain Fido to be a person...
> >
> > But I grant you this is not exactly what most people expect from the
> > term "constraint"... I also had to do the kind of explainations you
> > describe...
> 
> 
> Yes, exactly.
> 
> In earlier (1998ish) versions of RDFS we called them 'constraint
> resources' (with the anticipation of using that concept to flag up new
> constructs from anticipated developments like DAML+OIL and OWL). This
> didn't really work, because anything that had a solid meaning was a
> constraint in this sense, so we removed that wording.
> 
> This is a very interesting discussion, wish I had time this week to
> jump
> in further.
> 
> I do recommend against using RDFS/OWL to express application/dataset
> constraints, while recognising that there's a real need for recording
> them in machine-friendly form. In the Dublin Core world, this topic is
> often discussed in terms of "application profiles", meaning that we
> want
> to say things about likely and expected data patterns, rather than
> doing
> what RDFS/OWL does and merely offering machine dictionary definitions
> of

AW: DBpedia 3.2 release, including DBpedia Ontology and RDF links to Freebase

2008-11-17 Thread Chris Bizer

Hi Hugh and Richard,

interesting discussion indeed. 

I think that the basic idea of the Semantic Web is that you reuse existing
terms or at least provide mappings from your terms to existing ones.

As DBpedia is often used as an interlinking hub between different datasets
on the Web, it should in my opinion clearly have a type b) ontology using
Richard's classification.

But what does this mean for WEB ontology languages?

Looking at the current discussion, I feel reassured that if you want to do
WEB stuff, you should not move beyond RDFS, even aim lower and only use a
subset of RDFS (basically only rdf:type, rdfs:subClassOf and
rdfs:subPropertyOf) plus owl:SameAs. Anything beyond this seems to impose
too tight restrictions, seems to be too complicated even for people with
fair Semantic Web knowledge, and seems to break immediately when people
start to set links between different schemata/ontologies.

Dublin Core and FOAF went down this road. And maybe DBpedia should do the
same (meaning to remove most range and domain restrictions and only keep the
class and property hierarchy).

Can anybody of the ontology folks tell me convincing use cases where the
current range and domain restrictions are useful? 

(Validation does not count as WEB ontology languages are not designed for
validation and XML schema should be used instead if tight validation is
required).

If not, I would opt for removing the restrictions.

Cheers

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im
Auftrag von Hugh Glaser
Gesendet: Montag, 17. November 2008 23:33
An: Richard Cyganiak
Cc: public-lod@w3.org; Semantic Web;
[EMAIL PROTECTED]
Betreff: Re: DBpedia 3.2 release, including DBpedia Ontology and RDF links
to Freebase


Very nicely put, Richard.
We are opening up the discussion here of when to define one's own and when
to (re-)use from elsewhere.
I am a bit uncomfortable with the idea of "you should use a:b from c and d:e
from f and g:h from i..."
It makes for a fragmented view of my data, and might encourage me to use
things that do not capture exactly what I mean, as well as introducing
dependencies with things that might change, but over which I have no
control.
So far better to use ontologies of type (b) where appropriate, and define my
own of type (a), which will (hopefully) be nicely constructed, and easier to
understand as smallish artefacts that can be looked at as a whole.
Of course, this means we need to crack the infrastructure that does dynamic
ontology mapping, etc.
Mind you, unless we have the need, we are less likely to do so.
I also think that the comments about the restrictions being a characteristic
of the dataset for type (a), but more like comments on the world for type
(b) are pretty good.
Hugh

On 17/11/2008 20:09, "Richard Cyganiak" <[EMAIL PROTECTED]> wrote:



John,

Here's an observation from a bystander ...

On 17 Nov 2008, at 17:17, John Goodwin wrote:

> This is also a good example of where (IMHO) the domain was perhaps
> over specified. For example all sorts of things could have
> publishers, and not the ones listed here. I worry that if you reuse
> DBpedia "publisher" elsewhere you could get some undesired inferences.

But are the DBpedia classes *intended* for re-use elsewhere? Or do
they simply express restrictions that apply *within DBpedia*?

I think that in general it is useful to distinguish between two
different kinds of ontologies:

a) Ontologies that express restrictions that are present in a certain
dataset. They simply express what's there in the data. In this sense,
they are like database schemas: If "Publisher" has a range of
"Person", then it means that the publisher *in this particular
dataset* is always a person. That's not an assertion about the world,
it's an assertion about the dataset. These ontologies are usually not
very re-usable.

b) Ontologies that are intended as a "lingua franca" for data exchange
between different applications. They are designed for broad re-use,
and thus usually do not add many restrictions. In this sense, they are
more like controlled vocabularies of terms. Dublin Core is probably
the prototypical example, and FOAF is another good one. They usually
don't allow as many interesting inferences.

I think that these two kinds of ontologies have very different
requirements. Ontologies that are designed for one of these roles are
quite useless if used for the other job. Ontologies that have not been
designed for either of these two roles usually fail at both.

Returning to DBpedia, my impression is that the DBpedia ontology is
intended mostly for the first role. Maybe it should be understood more
as a schema for the DBpedia dataset, and not so much as a re-usable
set of terms for use outside of the Wikipedia context. (I might be
wrong, I was not involved in its creation.)

Richard





AW: ANN: DBpedia 3.2 release, including DBpedia Ontology and RDF links to Freebase

2008-11-17 Thread Chris Bizer

Hi Andreas,

we for sure want to do this, but also did not want to postpone the DBpedia
3.2 release any further.

So be ensured that the upcoming public user interface for defining the
infobox-to-ontology mappings will include the possibility to reuse existing
classes and properties and that external classes and properties will be used
within the 3.3 release.

Defining the infobox-to-ontology mappings that we currently have was already
a lot of work (Anja thanks again), so please be patient with the
mappings/reuse of external ontologies.

Cheers

Chris

 

> -Ursprüngliche Nachricht-
> Von: Andreas Harth [mailto:[EMAIL PROTECTED]
> Gesendet: Montag, 17. November 2008 16:55
> An: Chris Bizer
> Cc: public-lod@w3.org; 'Semantic Web'; dbpedia-
> [EMAIL PROTECTED]; dbpedia-
> [EMAIL PROTECTED]
> Betreff: Re: ANN: DBpedia 3.2 release, including DBpedia Ontology and
> RDF links to Freebase
> 
> Hi Chris,
> 
> Chris Bizer wrote:
> > 1. DBpedia Ontology
> >
> > DBpedia now features a shallow, cross-domain ontology, which has been
> > manually created based on the most commonly used infoboxes within
> Wikipedia
> great work!
> 
> One thing: what's the reason for creating your own classes rather
> than re-using or sub-classing existing ones (foaf:Person,
> geonames:Feature...)?  Same for properties (foaf:name, dc:date...).
> 
> Regards,
> Andreas.
> 
> --
> http://swse.deri.org/




ANN: DBpedia 3.2 release, including DBpedia Ontology and RDF links to Freebase

2008-11-17 Thread Chris Bizer

Hi all,

we are happy to announce the release of DBpedia version 3.2.  

The new knowledge base has been extracted from the October 2008 Wikipedia
dumps. Compared to the last release, the new knowledge base provides three
mayor improvements:


1. DBpedia Ontology

DBpedia now features a shallow, cross-domain ontology, which has been
manually created based on the most commonly used infoboxes within Wikipedia.
The ontology currently covers over 170 classes which form a subsumption
hierarchy and have 940 properties. The ontology is instanciated by a new
infobox data extraction method which is based on hand-generated mappings of
Wikipedia infoboxes to the DBpedia ontology. The mappings define
fine-granular rules on how to parse infobox values. The mappings also adjust
weaknesses in the Wikipedia infobox system, like having different infoboxes
for the same class (currently 350 Wikipedia templates are mapped to 170
ontology classes), using different property names for the same property
(currently 2350 Wikipedia template properties are mapped to 940 ontology
properties), and not having clearly defined datatypes for property values.
Therefore, the instance data within the infobox ontology is much cleaner and
better structured than the infobox data within the DBpedia infobox dataset
that is generated using the old infobox extraction code. The DBpedia
ontology currently contains about 882.000 instances.

More information about the ontology is found at:
http://wiki.dbpedia.org/Ontology 


2. RDF Links to Freebase

Freebase is an open-license database which provides data about million of
things from various domains. Freebase has recently released an Linked Data
interface to their content. As there is a big overlap between DBpedia and
Freebase, we have added 2.4 million RDF links to DBpedia pointing at the
corresponding things in Freebase. These links can be used to smush and fuse
data about a thing from DBpedia and Freebase.

For more information about the Freebase links see:
http://blog.dbpedia.org/2008/11/15/dbpedia-is-now-interlinked-with-freebase-
links-to-opencyc-updated/


3. Cleaner Abstacts

Within the old DBpedia dataset it occurred that the abstracts for different
languages contained Wikpedia markup and other strange characters. For the
3.2 release, we have improved DBpedia's abstract extraction code which
results in much cleaner abstracts that can safely be displayed in user
interfaces. 


The new DBpedia release can be downloaded from:

http://wiki.dbpedia.org/Downloads32

and is also available via the DBpedia SPARQL endpoint at

http://dbpedia.org/sparql

and via DBpedia's Linked Data interface. Example URIs: 

http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Oliver_Stone

More information about DBpedia in general is found at:

http://wiki.dbpedia.org/About


Lots of thanks to everybody who contributed to the Dbpedia 3.2 release! 

Especially:

1. Georgi Kobilarov (Freie Universität Berlin) who designed and implemented
the new infobox extraction framework. 
2. Anja Jentsch (Freie Universität Berlin) who contributed to implementing
the new extraction framework and wrote the infobox to ontology class
mappings. 
3. Paul Kreis (Freie Universität Berlin) who improved the datatype
extraction code. 
4. Andreas Schultz (Freie Universität Berlin) for generating the Freebase to
DBpedia RDF links.
5. Everybody at OpenLink Software for hosting DBpedia on a Virtuoso server
and for providing the statistics about the new Dbpedia knowledge base.

Have fun with the new DBpedia knowledge base!

Cheers

Chris


--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
[EMAIL PROTECTED]





OPEN POSITION: Move to Berlin, work on DBpedia

2008-10-22 Thread Chris Bizer



Hello,

we are happy to announce that Neofonie (http://www.neofonie.de), a Berlin
search engine company, has agreed to fund a researcher/developer to work on
DBpedia (http://dbpedia.org/About) for one year.

The researcher/developer will work half of his time directly together with
the DBpedia team at Freie Universität Berlin and will work the other half at
Neofonie.

At Freie Universität, he will contribute to the further development of the
DBpedia information extraction framework and will investigate approaches to
augment and fusion DBpedia with external Web data sources.

At Neofonie, he will develop innovative Wikipedia/DBpedia search and query
user interfaces based on a facet-browsing engine being developed by
Neofonie.

The ideal candidate should have:

+ a firm background in Computer Science and Semantic Web Technology
+ excellent programming skills (PHP, Java)
+ excellent knowledge of web-related languages and standards (RDF, SPARQL)
+ a sound understanding of human-computer interaction and interface design
+ experience with information extraction techniques

I think that the position offers a great opportunity to contribute to the
transfer of research results to early adopters within industry. You will be
given the chance to be part of two innovative but also cordial teams. After
the year, chances are high that you will be able to choose between
longer-term positions at Neofonie as well as at Freie Universität Berlin.

If you are interested in the position, please contact me via email for
additional details (including information about your skills and experience).

Cheers,

Chris

--
Prof. Dr. Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 54057
Mail: [EMAIL PROTECTED]
Web: www.bizer.de






Open Archives Initiative has released OAI-ORE specification based on Linked Data principles

2008-10-19 Thread Chris Bizer
Hi all,

 

great news from the library and preprint server world: The Open Archives
Initiative (OAI, http://www.openarchives.org/) has released its new Object
Reuse and Exchange (OAI-ORE) specification for describing aggregations of
Web resources. Such aggregations can for instance be different versions of a
paper on a preprint server, the issues of a journal, the chapters of a book,
a collection of photos on flickr, or a series of blog posts.

 

By providing a way to describe such aggregations, the new OAI-ORE
specifications aims at moving the library world closer to the Web and to
enable Web clients, such as the crawlers of search engines like Google or
Yahoo or generic Web data browsers like Tabulator or Marbles, to do smarter
things with metadata about publications.

 

The OAI-ORE specification is build on the Linked Data and Cool URIs
principles, meaning that all objects of interest are identified with HTTP
URIs, these URIs are dereferencable to RDF descriptions, and it is thus
possible to interlink data between different repositories. Metadata about
aggregations is represented using a mix of well known vocabularies such as
Dublin Core or FOAF.

 

For more information about OAI-ORE please refer to:

 

1. ORE User Guide http://www.openarchives.org/ore/1.0/primer

2. ORE Specifications Table of Contents
http://www.openarchives.org/ore/1.0/toc

3. ORE Release Note
http://groups.google.com/group/oai-ore/browse_thread/thread/dccb1daef89fabf0

 

With its broad scope, the OAI-ORE specification clearly overlaps with
ongoing work around POWDER (http://www.w3.org/2007/powder/) and SIOC
(http://sioc-project.org/) and it will be interesting to see how things play
together.

 

The classic OAI metadata harvesting protocol (OAI-PMH) is used by hundreds
of libraries and archives to exchange metadata about more than 9 billion
documents and books. I think it is very promising from the Web perspective
that OAI dropped OAI-PMH's point-to-point data exchange paradigm in favor
for the open Web architecture in OAI-ORE. I also think that the deployment
of OAI-ORE within the libraries community could develop into a major step
forward for the Semantic Web as it might extend the Semantic Web with
comprehensive data about another domain.

 

Cheers,

 

Chris

 

 

 



CfP: IJSWIS Special Issue on Linked Data

2008-10-13 Thread Chris Bizer


Hello all,

the International Journal on Semantic Web and Information Systems
(http://www.ijswis.org/) seeks contributions to a special issue on 
Linked Data.

Special issue webpage: http://linkeddata.org/docs/ijswis-special-issue


Introduction
---

The Web is increasingly understood as a global information space, consisting
not just of linked documents, but also of linked data. In addition to the
maturing of the Semantic Web technology stack, a major catalyst in this
transition has been the application of the Linked Data principles [1],
hand-in-hand with the publication and dense, mutual interlinking of
large-scale data sets distributed across the Web [2]. This movement has
brought the vision of a “Web of Data” closer to realization than ever befor.
However, the emergence of Linked Data on a Web scale raises numerous novel
and significant research challenges that touch on both the “semantics” and
“Web” aspects of the Semantic Web vision. These challenges are diverse in
nature and rang from algorithmic approaches for linking and fusing Web data,
over those of providing user applications on top of distributed and
heterogeneous data sets, to social and business questions related to the
production and consumption of Linked Data. Building on successful events in
the field, such as the 1st Workshop on Linked Data on the Web (LDOW2008)
[3], the goal of this special issue is to solicit high quality, original
research contributions on all aspects of Linked Data, thereby capturing the
state of the art and stimulating further developments in this and related
areas.


Topics
--

Topics of interest for this IJSWIS special issue include, but are not
limited to:

+ Data Linking and Fusion
   - Identity resolution 
   - Linking algorithms and heuristics 
   - Data fusion and integration 

+ Linked Data Application Architectures
   - Crawling, caching, and querying Linked Data from the Web
   - Evaluating the quality , trustworthiness, and task-
 appropriateness of Web data
   - Reasoning with and over Web data
   - User-facing applications that exploit Linked Data
  - Linked Data browsers and analysis interfaces
  - Linked Data search engines and query interfaces
  - User interaction and interface issues in Linked Data applications
   - Publishing legacy data sources as Linked Data on the Web
   - Publishing user-generated content as Linked Data on the Web 
 
+ Business Models and Social Aspects
   - Business models for Linked Data publishing and consumption
   - Licensing and other legal issues in Linked Data publishing
   - Authority and provenance tracking
   - Privacy and the Web of Data


Submission Process
--

Submissions to this special issue should follow the journal's guidelines for
submission
(http://www.idea-group.com/journals/details.asp?ID=4625&v=guidelines). After
submitting a paper, please also inform the guest editors by email,
indicating the paper ID assigned by the submission system. Papers must be of
high quality and should clearly state the technical issue(s) being addressed
as related to Linked Data on the Web. Research papers should present a proof
of concept for any novel technique they are proposing. Wherever possible,
submissions should demonstrate the contribution of the research by reporting
on a systematic evaluation of the work. If a submission is based on a prior
publication in a workshop or conference, the journal submission must involve
substantial advance (a minimum of 30%) in conceptual terms as well as in
exposition (e.g., more comprehensive testing/evaluation/validation or
additional applications/usage). If this applies to your submission, please
explicitly reveal the relevant previous publications.

All papers must be submitted by January 7, 2009. The recommended length of
submitted papers is between 5,500 to 8,000 words. All papers are subject to
peer review performed by at least three established researchers drawn from a
panel of experts selected for this special issue. Accepted papers will
undergo for a second cycle of revision and reviewer feedback. Please submit
manuscripts as a PDF file using the online submission system.

The International Journal on Semantic Web and Information Systems (IJSWIS)
is the first Semantic Web journal to be included in the Thomson ISI citation
index. More information on the journal can be found at
http://www.ijswis.org. 


Important Dates
---

+ January 7, 2009: Submission deadline
+ February 7, 2009: Notification of acceptance
+ March 7, 2009: Camera-ready papers due
+ Second quarter of 2009: Publication 


Special Issue Organizing Committee
--

Editor in Chief: 
+ Amit Sheth (Kno.e.sis Center, Wright State University, USA)

Guest Editors:
+ Chris Bizer (Freie Universität Berlin, Germany)
+ Tom Heath (Talis Information Ltd, United Kingdom)
+ Martin Hepp (Universität der Bundeswehr München, Germany)


References
---

[1] Berners-Lee, 

Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-30 Thread Chris Bizer


Hi Orri,

> It is my feeling that RDF has a dual role:  
> 1. interchange format:  This is like what XML does, except that RDF has
more semantics and expressivity.  
> 2: Database storage format for cases where data must be integrated and is
too heterogenous to easily 
> fall into one relational schema.  This is for example the case in the open
web conversation and social 
> space.  The first case is for mapping, the second for warehousing.   
> Aside this, there is potential for more expressive queries through  the
query language dealing with
> inferencing, like  subclass/subproperty/transitive etc.  These do not go
very well with SQL views.

I cannot agree more with what you say :-)

We are seeing the first RDF use case emerge within initiatives like the
Linking Open Data effort, where beside of being more expressive, RDF is also
playing its strength to provide for data links between record in different
databases.

Talking with people from industry, I get the feeling that also more and more
people understand the second use case and that RDF is increasingly used as a
technology for something like "poor man's data integration". You don't have
to spend a lot of time and money one designing a comprehensive data
warehouse. You just throw data having different schemata from different
sources together and instantly get the benefit that you can browse and query
the data and that you have proper provenance tracking (using Named Graphs).
Depending on how much data integration you need, you then start to apply
some identity resolution and schema mapping techniques. We have been talking
to some pharma and media companies that do data warehousing for years and
they all seam to be very interested in this quick and dirty approach.

For both use cases, inferencing is a nice add-on but not essential. Within
the first use case, inferencing usually does not work as data published by
various autonomous sources tends to be to dirty for reasoning engines.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im
Auftrag von Orri Erling
Gesendet: Dienstag, 30. September 2008 00:16
An: 'Seaborne, Andy'; 'Story Henry'
Cc: [EMAIL PROTECTED]
Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


>From Henry Story:

>
>
> As a matter of interest, would it be possible to develop RDF stores
> that optimize the layout of the data by analyzing the queries to the
> database? A bit like a Java Just In Time compiler analyses the usage
> of the classes in order to decide how to optimize the compilation.

>From Andy Seaborne:

On a similar note, by mining the query logs it would be possible to create
parameterised queries and associated plan fragments without the client
needing to notify the server of the templates.  Couple with automatically
calculating possible materialized views or other layout optimizations, the
poor, overworked client application writer doesn't get brought into
optimizing the server.

Andy

>
 
Orri here:

With the BSBM workload, using parametrized queries as a small scale saves
roughly 1/3 of the execution time.  It is possible to remember query plans
and to notice if the same query text is submitted with only changes in
literal values.  If the first query ran quickly, one may presume the query
with substitutions will also run quickly.  There are of course exceptions.
But detecting these will mean running most of the optimizer cost model and
will eliminate any benefit from caching.


The other optimizations suggested have a larger upside but are far harder.  
I would say that if we have a predictable workload, then mapping 
relational to RDF is a lot easier than expecting the DBMS to figure out
materialized views to do the same.  If we do not have a predictable
workload, then making too many materialized views based on transient usage
patterns is a large downside because it grows the database, meaning less
working set.  The difference between in memory random access and a random
access with disk is about 5000 times.  Plus there is a high cost to making
the views, thus a high penalty for wrong guess.  Andif it is hard enough
to figure out where a query plan goes wrong with a given schema, it is
harder still to figure it out with a schema that morphs by itself.

In the RDB world, for example Oracle recommends saving optimizer statistics
from the  test  environment and using these in the production environment
just so the optimizer does not get creative.  Now this is the  essence of
wisdom for OLTP but we are not talking OLTP with RDF. 

If there is a history of usage and this history is steady and the dba can
confirm it as being a representative sample, then automatic materializing
of joins is a real  possibility.  Doing this spontaneously would lead to
erratic response times, though.  For anything online, the accent is more on
predictable throughput than peak throughput.  

The BSBM query mix does lend itself 

AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-24 Thread Chris Bizer

Hi Kingsley and Paul,

Yes, I completely agree with you that different storage solutions fit
different use cases and that one of the main strengths of the RDF data model
is its flexibility and the possibility to mix different schemata.

Nevertheless, it think it is useful to give application developers an
indicator about what performance they can expect when they choose a specific
architecture, which is what the benchmark is trying to do.

We plan to run the benchmark again in January and it would be great to also
test Tucana/Kowari/Mulgara in this run.

As the performance of RDF stores is constantly improving, let's also hope
that the picture will not look that bad for them anymore then.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag
von Kingsley Idehen
Gesendet: Mittwoch, 24. September 2008 20:57
An: Paul Gearon
Cc: [EMAIL PROTECTED]; public-lod@w3.org
Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


Paul Gearon wrote:
> On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:
>   
>> On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:
>> 
 Has has there been any analysis on whether there is a *fundamental*
 reason for such performance difference? Or is it simply a question of
 "maturity"; in other words, relational db technology has been around
for a
 very long time and is very mature, whereas RDF implementations are
still
 quite recent, so this gap will surely narrow ...?
 
>>> This is a very complex subject.  I will offer some analysis below, but
>>> this I fear will only raise further questions.  This is not the end of
the
>>> road, far from it.
>>>   
>> As far as I understand, another issue is relevant: this benchmark is
>> somewhat unfair as the relational stores have one advantage compared to
the
>> native triple stores: the relational data structure is fixed (Products,
>> Producers, Reviews, etc with given columns), while the triple
representation
>> is generic (arbitrary s,p,o).
>> 
>
> This point has an effect on several levels.
>
> For instance, the flexibility afforded by triples means that objects
> stored in this structure require processing just to piece it all
> together, whereas the RDBMS has already encoded the structure into the
> table. Ironically, this is exactly the reason we
> (Tucana/Kowari/Mulgara) ended up building an RDF database instead of
> building on top of an RDBMS: The flexibility in table structure was
> less efficient that a system that just "knew" it only had to deal with
> 3 columns. Obviously the shape of the data (among other things)
> dictates what it is the better type of storage to use.
>
> A related point is that processing RDF to create an object means you
> have to move around a lot in the graph. This could mean a lot of
> seeking on disk, while an RDBMS will usually find the entire object in
> one place on the disk. And seeks kill performance.
>
> This leads to the operations used to build objects from an RDF store.
> A single object often requires the traversal of several statements,
> where the object of one statement becomes the subject of the next.
> Since the tables are typically represented as
> Subject/Predicate/Object, this means that the main table will be
> "joined" against itself. Even RDBMSs are notorious for not doing this
> efficiently.
>
> One of the problems with self-joins is that efficient operations like
> merge-joins (when they can be identified) will still result in lots of
> seeking, since simple iteration on both sides of the join means
> seeking around in the same data. Of course, there ARE ways to optimize
> some of this, but the various stores are only just starting to get to
> these optimizations now.
>
> Relational databases suffer similar problems, but joins are usually
> only required for complex structures between different tables, which
> can be stored on different spindles. Contrast this to RDF, which needs
> to do do many of these joins for all but the simplest of data.
>
>   
>> One can question whether such flexibility is relevant in practice, and if
>> so, one may try to extract such structured patterns from data on-the-fly.
>> Still, it's important to note that we're comparing somewhat different
things
>> here between the relational and the triple representation of the
benchmark.
>> 
>
> This is why I think it is very important to consider the type of data
> being stored before choosing the type of storage to use. For some
> applications an RDBMS is going to win hands down every time. For other
> applications, an RDF store is definitely the way to go. Understanding
> the flexibility and performance constraints of each is important. This
> kind of benchmarking helps with that. It also helps identify where RDF
> databases need to pick up their act.
>
> Regards,
> Paul Gearon
>
>
>   
Paul,

You make valid points, the problem here is that the be

Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-17 Thread Chris Bizer


Hi all,

over the last weeks, we have extended the Berlin SPARQL Benchmark 
(BSBM) to a multi-client scenario, fine-tuned the benchmark dataset 
and the query mix, and implemented a SQL version of the benchmark in 
order to be able to compare SPARQL stores with classical SQL stores.


Today, we have released the results of running the BSBM Benchmark 
Version 2 against:


+ three RDF stores (Virtuoso Version 5.0.8, Sesame Version 2.2, Jena 
TDB Version 0.53) and
+ two relational database-to-RDF wrappers (D2R Server Version 0.4 and 
Virtuoso - RDF Views Version 5.0.8).


for datasets ranging from 250,000 triples to 100,000,000 triples.

In order to set the SPARQL query performance into context we also 
report the results of running the SQL version of the benchmark against 
two relational database management systems (MySQL 5.1.26 and 
Virtuoso - RDBMS Version 5.0.8).


A comparison of the performance for a single client working against 
the stores is found here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#comparison

A comparison of the performance for 1 to 16 clients simultaneously 
executing query mixes against the stores is found here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#multiResults

The complete benchmark results including the setup of the experiment 
and the configuration of the different stores is found here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html

The current specification of the Berlin SPARQL Benchmark is found 
here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20080912/

It is interesting to see:

1. that relational database to RDF wrappers generally outperform RDF 
stores for larger dataset sizes.
2. that no store outperforms the others for all queries and dataset 
sizes.
3. that the query throughput still varies widely within the 
multi-client scenario.
4. that the fastest RDF store is still 7 times slower than a 
relational database.


Thanks a lot to

+ Eli Lilly and Company and especially Susie Stephens for making this 
work possible through a research grant.
+ Orri Erling, Andy Seaborne, Arjohn Kampman, Michael Schmidt, Richard 
Cyganiak, Ivan Mikhailov, Patrick van Kleef, and Christian Becker for 
their feedback on the benchmark design and their help with configuring 
the stores and running the benchmark experiment.


Without all your help it would not been possible to conduct this 
experiment.


We highly welcome feedback on the benchmark design and the results of 
the experiment.


Cheers,

Chris Bizer and Andreas Schultz

--
Prof. Dr. Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 55509
Mail: [EMAIL PROTECTED]
Web: www.bizer.de 





AW: Linking heuristics

2008-09-17 Thread Chris Bizer
Hi Daniel,

 

some material about linking heuristics and their application is collected at

 

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/Equival
enceMining

 

 

If you or anybody else from the list knows about more, it would be great if
you could add it to the wikipage so that the community keeps on having a
central starting point for this important topic.

 

Cheers

 

Chris

 

 

Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag
von Daniel Schwabe
Gesendet: Mittwoch, 17. September 2008 01:08
An: public-lod@w3.org
Betreff: Linking heuristics

 

Hi all,

Has anybody compiled a list of linking heuristics that have been employed
(so far) when connecting the various sources in(to) the LoD cloud?
This might be a useful piece of knowledge for practitioners wishing to link
into the cloud...
Another interesting issue is how could we promote new linkages *from*
existing sources *to* new sources joining in?

Cheers
Daniel



-- 


Daniel Schwabe
Tel:+55-21-3527 1500 r. 4356
Fax: +55-21-3527 1530
http://www.inf.puc-rio.br/~dschwabe

Dept. de Informatica, PUC-Rio
R. M. de S. Vicente, 225
Rio de Janeiro, RJ 22453-900, Brasil

 



AW: New LOD Cloud - Please send us links to missing data sources

2008-09-16 Thread Chris Bizer

Hi Kingsley,

lots of interesting stuff. I especially like to see the first Freebase
wrapper in action.

I'm very open to include live wrappers into the cloud as long as the
generated data is somehow interlinked with other datasets in the cloud.

It would be great, if you could tell us which wrappers generate links to
other datasets or provide URIs that are referenced by other datasets in the
cloud.

Thanks a lot for your help.

Cheers

Chris

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag
von Kingsley Idehen
Gesendet: Dienstag, 16. September 2008 17:12
An: Chris Bizer
Cc: public-lod@w3.org; Anja Jentzsch
Betreff: Re: New LOD Cloud - Please send us links to missing data sources


Chris Bizer wrote:
> Hi all,
>
> Anja and I are currently updating the LOD cloud for the ESW wikipage. 
> Draft attached.
>
> Up till now we have added:
>
> 1.CrunchBase
> 2. LinkedMDB
> 3. YAGO
> 4. UMBEL
> 5. the PubGuide
Chris,

Since you include wrappers (e.g. Flickr) what about other wrappers (what 
we call Proxy URIs that produce Linked Data using our Sponger Web Service)?

We have wrappers for about 30 different data sources as per:

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s  
(notice the new table covering our cartridges with example URIs that 
includes: CrunchBase, Freebase, TheyWorkForYou, XBRL, and many others).

Kingsley
>
> It nice to see that fitting everything into one diagram is getting 
> increasingly difficult as the cloud grows :-)
>
> Did we forget any new data sources or links between data sources?
>
> As discussed before: A data source qualifies for the cloud, if the 
> data is available via dereferencable URIs and if the data source is 
> interlinked with at least one other source (meaning it references URIs 
> within the namespace of the other source).
>
> Any feedback highly welcome.
>
> Cheers
>
> Chris
>
>
>
>
> -- 
> Prof. Dr. Chris Bizer
> Freie Universität Berlin
> Phone: +49 30 838 55509
> Mail: [EMAIL PROTECTED]
> Web: www.bizer.de
>
> 
>
>


-- 


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: List Policy for Job Advertisments

2008-08-29 Thread Chris Bizer


Hi Ian,

I would say: Sure, no problem as long as the jobs are Linked Data 
related.


Cheers

Chris


--
Prof. Dr. Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 55509
Mail: [EMAIL PROTECTED]
Web: www.bizer.de

- Original Message - 
From: "Ian Davis" <[EMAIL PROTECTED]>

To: 
Sent: Friday, August 29, 2008 4:00 PM
Subject: List Policy for Job Advertisments



Hi all,

I was wondering what the list's policy was on posting relevant job
advertisments?

Ian






Re: BSBM With Triples and Mapped Relational Data in Virtuoso

2008-08-07 Thread Chris Bizer


Hi Orri and Ivan,


Consequently, we need to show that mapping can outperform an RDF
warehouse, which is what we'll do here.


Yes. I was already guessing for a while that SPARQL against RDF-mapped 
relational DBs should be faster than SPARQL against triple stores. 
With D2R Server it turned out that some queries are much faster, but 
also that D2R Server really performas bad on others (especially Q5). 
The bad performance with some queries was no surprise as there is 
still lots of room for improvements in D2R Servers SPARQL-to-SQL query 
rewriting algorithm.
Another observation was that the distance between native RDF stores 
and RDF-mapped RDBs increases with dataset size.
So it looks like that if you have more than 50M triples and schemata 
that somehow fits into a RDB, you should go for the RDF solution.



We also see that the advantage of mapping can be further increased
by more compiler optimizations, so we expect in the end mapping will
lead RDF warehousing by a factor of 4 or so.


Being able to show a factor 4 on all dataset sizes would be very 
interesting!



Suggestions for BSBM

* Reporting Rules. The benchmark spec should specify a form for
 disclosure of test run data, TPC style. This includes things like
 configuration parameters and exact text of queries. There should
 be accepted variants of query text, as with the TPC.


We have started collecting stuff that should go into the 
full-disclosure report in section 6.2 of the benchmark spec 
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html#reporting 
but did not had the time to define a proper format for this yet (I 
guess we will have some XML format). We will define the format for 
version 2 of the benchmark, which will be released together with 
updated results in about 3-4 weeks.


If you think that there is something missing from this list, please 
let us know.



* Multiuser operation. The test driver should get a stream number as
 parameter, so that each client makes a different query sequence.
 Also, disk performance in this type of benchmark can only be
 reasonably assessed with a naturally parallel multiuser workload.


Yes. This is already on our todo list and will also be part of the 
next release.



* Add business intelligence. SPARQL has aggregates now, at least
 with Jena and Virtuoso, so let's use these. The BSBM business
 intelligence metric should be a separate metric off the same data.
 Adding synthetic sales figures would make more interesting queries
 possible. For example, producing recommendations like "customers
 who bought this also bought xxx."


Hmm, yes and no. I would love to extend the benchmark with a BI query 
mix, but aggregates are not yet an official part of SPARQL. Our goal 
with the benchmark was to define a tool to compare stores that 
implement the current SPARQL specs but not to fix these specs. Thus, 
we stayed in the bounderies of the current spec and of couse ran into 
all the know problems of SPARQL (no aggregates, no free-text search, 
no proper negation). All these things were discussed at the SPARQL 2 
BOF at WWW2008 and I hope that they are all on Ivan Herman's list for 
the charter of a new SPARQL WG.



* For the SPARQL community, BSBM sends the message that one ought to
 support parameterized queries and stored procedures. This would be
 a SPARQL protocol extension; the SPARUL syntax should also have a
 way of calling a procedure. Something like select proc (??, ??)
 would be enough, where ?? is a parameter marker, like ? in
 ODBC/JDBC.


Also a great idea and maybe something Ivan does not have on his list 
yet.



* Add transactions.Especially if we are contrasting mapping vs.
 storing triples, having an update flow is relevant. In practice,
 this could be done by having the test driver send web service
 requests for order entry and the SUT could implement these as
 updates to the triples or a mapped relational store. This could
 use stored procedures or logic in an app server.


In principle yes, but we also wanted to design a benchmark that some 
current RDF stores are able to run.
If I look at the current data load times of the SUTs  I'm not so sure 
that they like update streams ;-)


But I agree that update streams are clearly something that we should 
have in the future.



Comments on Query Mix

The time of most queries is less than linear to the scale factor. Q6
is an exception if it is not implemented using a text index. Without
the text index, Q6 will inevitably come to dominate query time as 
the
scale is increased, and thus will make the benchmark less relevant 
at

larger scales.


You are right and it is again a problem of us trying to stay in the 
bounderies of the SPARQL spec.
No sane person would use a regex for this kind of free-text search, 
but SPARQL only offers the regex function and nothing else.


Maybe we should be a bit less strict here and allow proprietary 
variants of Q6 until SPARQL got fixed.



Next

We include the sources of o

Re: Linked Movie DataBase

2008-08-01 Thread Chris Bizer


Hi Oktie and Mariano,

realy nice work and clearly what Tim Berners-Lee would call "Semantic 
Web done right" :-)


We would love to set links from DBpedia into LinkedMDB.

Would it be possible that you send us a file with the owl:sameAs links 
between  LinkedMDB and DBpedia?


As Yves and Juan, I'm also very interested to hear more about 
ODDLinker.


Cheers

Chris


--
Prof. Dr. Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 55509
Mail: [EMAIL PROTECTED]
Web: www.bizer.de

- Original Message - 
From: "Oktie Hassanzadeh" <[EMAIL PROTECTED]>

To: 
Cc: "Mariano Consens" <[EMAIL PROTECTED]>
Sent: Thursday, July 31, 2008 6:58 PM
Subject: Linked Movie DataBase




Greetings everyone!

We are pleased to announce the release of the preview version of the 
"Linked Movie DataBase" (LinkedMDB): http://www.linkedmdb.org


LinkedMDB aims at publishing the first open linked data dedicated to 
movies. It currently contains over three million RDF triples with 
hundreds of thousands of RDF links to other LOD project data sources 
and movie-related websites. Please check the LinkedMDB website for a 
description of the data source, a brief overview of the interlinking 
methodology used, and detailed statistics.


We welcome any kind of feedback and comment from the LOD community, 
either off-list or on the list (if of interest to everyone).


And, if you liked this project, please don't forget to follow this 
link and vote for us!:

http://triplify.org/Challenge/Nominations

Cheers,
Oktie Hassanzadeh and Mariano Consens
[EMAIL PROTECTED]

PS. To know more about us and our research, please check our 
homepages:

http://www.cs.toronto.edu/~oktie
http://www.cs.toronto.edu/~consens







RFC: Berlin SPARQL Benchmark

2008-07-30 Thread Chris Bizer


Hi all,

SPARQL query language and the SPARQL protocol are implemented by a 
growing number of storage systems and are used within enterprise and 
open web settings. As SPARQL is taken up by the community there is a 
growing need for benchmarks to compare the performance of storage 
systems that expose SPARQL endpoints via the SPARQL protocol.


We have been working over the last week on such a benchmark called the 
Berlin SPARQL Benchmark (BSBM). The benchmark is built around an 
e-commerce use case in which a set of products is offered by different 
vendors and consumers have posted reviews about products. The 
benchmark query mix illustrates the search and navigation pattern of a 
consumer looking for a product.


We have also run the initial version of the benchmark against Sesame, 
Virtuoso, Jena SDB  and against D2R Server, a relational 
database-to-RDF wrapper. The stores were benchmarked with datasets 
ranging from 50,000 triples to 100,000,000 triples.


Our current benchmark spec as well as the results of our initial 
experiments are found at:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

In order to improve the Berlin SPARQL benchmark, it would be great to 
get feedback from the community on the benchmark specification. So if 
you think we have missed something essential or if you have ideas for 
further improvements, please let us know by replying to this mail or 
by contacting us directly.


Also note that there is ongoing work on a second SPARQL benchmark: The 
SP2B SPARQL Performance Benchmark 
(http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B). 
Compared with the BSBM benchmark, SP2B focuses more on testing 
different storage layouts and RDF data processing approaches while we 
try to be strictly use case driven.


We are looking forward to your feedback :-)

Cheers

Chris and Andreas

--
Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 55509
Mail: [EMAIL PROTECTED]
Web: www.bizer.de 





Re: Southampton Pub data as linked open data

2008-07-30 Thread Chris Bizer


Hi Bijan and Richard,

I think it would be helpful for this discussion to distinguish a bit between 
the different use cases of Semantic Web technologies.


Looking back at the developments over the last years, I think there are two 
general types of use cases:


1. Sophisticated, reasoning-focused applications which use an expressive 
ontology language and which require sound formal semantics and consistent 
ontologies in order to deliver their benefits to the user. In order to keep 
things consistent, these applications usually only work with data from a 
small set of data sources. In order to be able to apply sophisticated 
reasoning mechanisms, these applications usually only work with small 
datasets.


2. The general open Web use case where many information providers use 
Semantic Web technologies to publish and interlink structured data on the 
Web. Within this use case, the benefits for the user mainly come from the 
large amounts of Web-accessible data and the ability to discover related 
information from other data sources by following RDF links.


For each type of the use cases, there is usually a different set of 
technologies applied. OWL and classic heavy-weight reasoning for the first 
use case. HTTP, RDF, RDFS and light-weight smushing techniques for the 
second use case.


In the first use case, people think in terms of "ontologies", for instance a 
basic concept in OWL2 are ontologies. In the second use case, classes and 
properties are mixed from different vocabularies as people see fit and are 
related to each other by RDF links.


The second use case is inspired by the Web 2.0 movement and aims at 
extending the web with a

data commons into which *many* people publish data.

As it is not very likely that all these people will be logicians and 
understand (or are interested in) the formal semantics behind the things 
they do, people (including me) working on the second use case are often a 
bit critical about too tight formal semantics and extended public 
discussions about minor details that arise from some specs.


These discussions have been a mayor obstacle to deploying the Semantic Web 
over the last years as they drive away people away from using the 
technologies. I think that the normal Web developer will never bother going 
into the details of OWL (DL, 2 or whatever version). Fearing to do something 
wrong and state something that was not intended, common Web developers 
usually prefer not to touch these languages.


I'm personally convinced that we can do very cool things just with HTTP, RDF 
and RDFS for now and I see the current developments around Semantic Web 
browsers like Tabulator or Marbles, Semantic Web search engines like Sindice 
or Falcons and the growing number of people publishing Linked Data on the 
Web as clear indicators for this.


So why not being a bit more specific about the different use cases of the 
technologies and tell data publishers that it is OK just to use RDFS and 
that they do not have to care about the complicated details that arise from 
the different OWL specs.


Another idea along this line would be to rename OWL2 into Ontology 
Interchange Format (OIF). The Web rules language is already called Rules 
Interchange Format (RIF). Looking at the current OWL2 spec, I get the 
feeling that the working group designs a language for exchanging ontologies 
between knowledge based systems and that requirements from use case 2 do not 
play a very important role. Thus renaming the language could make the use 
case more clear and could be helpful for marketing the Semantic Web to Web 
developers that have understood the the benefits and limitations of 
microformats and now look for a better way to publish structured data on the 
Web.


Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de
- Original Message - 
From: "Jens Lehmann" <[EMAIL PROTECTED]>

To: "Richard Cyganiak" <[EMAIL PROTECTED]>
Cc: "Bijan Parsia" <[EMAIL PROTECTED]>; "John Goodwin" 
<[EMAIL PROTECTED]>; "Chris Wallace" 
<[EMAIL PROTECTED]>; ; <[EMAIL PROTECTED]>

Sent: Wednesday, July 30, 2008 7:47 AM
Subject: Re: Southampton Pub data as linked open data





Hello,

Richard Cyganiak schrieb:


Bijan, Knud, Bernard, thanks for the clarification.

I'm indeed surprised! Subclassing rdfs:label is okay in RDFS, and it is
okay in OWL Full, but it is not allowed in OWL DL.

The RDF consumers I'm working on (RDF browsers and the Sindice engine)
don't care if you're in OWL DL or not, so I'm tempted to argue that it
doesn't matter much for RDF publishing on the Web. (IME, on the open
Web, trust and provenance are much larger issues than inference, and I
don't believe that the open Web will ever be OWL DL, so why bother.)


Apart from the subject of this discussion, I find such general
statements ve

Re: Vapour 2.0, a Linked Data and RDF vocabularies validator

2008-07-17 Thread Chris Bizer


Hi Diego and Sergio,

thanks a lot for publishing this great tool!

I think having this validator is really a big step forward for the Web 
of Data, as we now have a central place where we can point data 
publishers to check their sites and don't have to go through all the 
issues on a one-to-one basis as it commonly happend in the past.


I have put a news about Vapour onto the LOD wikipage

http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

and have also added the service to the "Testing and Debugging" section 
of the "How to Publish Linked Data" tutorial.


http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/#testing

Cheers

Chris


--
Prof. Dr. Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 55509
Mail: [EMAIL PROTECTED]
Web: www.bizer.de

- Original Message - 
From: "Diego Berrueta" <[EMAIL PROTECTED]>

To: "Linking Open Data" 
Sent: Thursday, July 17, 2008 3:23 PM
Subject: ANN: Vapour 2.0, a Linked Data and RDF vocabularies validator




We're glad to announce Vapour 2.0, a validator for Linked Data and 
RDF vocabularies. An effort has been made to widen the scope of the 
validation to cover any kind of Linked Data (Vapour 1.0 was 
specifically targeted to RDF vocabularies). This new release 
contains a number of new exciting features, such as checking for 
meaningful triples in the response documents, links to popular 
semantic web browsers, and conclusions on the type of the resources 
per httpRange-14. Moreover, the input form is now simpler, and the 
reports are even more eye-catching than before.


Validate your linked data! Use our online service 
http://validator.linkeddata.org/


Source code and further details are available from:

http://vapour.sourceforge.net/

Many thanks to Tom Heath for allocating a linkeddata.org subdomain 
for Vapour.


Best,

--
Diego Berrueta and Sergio Fernández
R&D Department  -  CTIC Foundation
E-mail: [EMAIL PROTECTED]
Phone: +34 984 29 12 12
Parque Científico Tecnológico Gijón-Asturias-Spain
www.fundacionctic.org








Re: UMBEL Publicly Released

2008-07-16 Thread Chris Bizer


Hi Mike and Fred,

UMBEL (Upper Mapping and Binding Exchange Layer) [1] is a lightweight 
ontology for relating Web content and data to a standard set of 20,000 
subject concepts. Based on OpenCyc [2], these subject concepts have 
defined relationships between them, and can act as semantic binding nodes 
for any data or Web content.


Very nice work and great to see that UMBEL is available as Linked Data and 
already interlinked with various ontologies and datasets.


A further 1.5 million named entities have been extracted from Wikipedia 
and mapped to the UMBEL reference structure with cross-links to YAGO [3] 
and DBpedia [4].


In order to allow people to browse from DBpedia into UMBEL and in order to 
give Semantic Web crawlers more starting points to crawl UMBEL and its 
interlinked datasets, it would be great to set RDF links from DBpedia into 
UMBEL.


Would it be possible that you send us the 1.5 million RDF links from UMBEL 
to DBpedia that you already got so that we can serve them together with 
DBpedia via our Linked Data interface and SPARQL endpoint?


Reading the UMBEL documentation, I noticed that you use the umbel:isLike 
property to link to DBpedia and the owl:sameAs property to link to YAGO. For 
instance:


ne:Pfizer umbel:isLike dbpedia:Pfizer
ne:Pfizer owl:sameAs yago:Pfizer

The UMBEL spec defines umbel:isLike  as

"Additionally, the property umbel:isLike can be used to state that two named 
entities "likely" have the same identity."


Why do you think that your RDF links to YAGO are more likely to be correct 
than your links into DBpedia?


As all three datasets are derived from Wikipedia and Wikipedia page 
identifiers should be present in all three datasets, I think we could use 
owl:sameAs between all three.


Keep on the great work and let's hope that UMBEL develops into an important 
interlinking hub for the Web of Data.


Cheers

Chris


The system can easily be extended with additional dictionaries of named 
entities, including ones specific to enterprises or domains.


UMBEL is provided as open source under the Creative Commons 3.0 
Attribution-Share Alike license. The complete ontology with all subject 
concepts, definitions, terms and relationships can be freely downloaded 
[see 5].  All subject concepts and named entities are available as Linked 
Data [see 5].  Five volumes of documentation [5] are also available.


The release is accompanied by about a dozen Web services [6] for using or 
manipulating UMBEL, along with a new introductory slide show [7].


Additional release information may be found on Fred's [8] or my [9] 
separate blog postings.


We welcome those with interest or suggestions for improvements to do so 
through the UMBEL discussion forum [10].  We will shortly be putting 
easier services online for such input.


So, enjoy!  We look forward to your commentary, suggestions and putting 
UMBEL under production-grade stress.  We know will be doing the same!


Regards, Mike


[1]  http://www.umbel.org/
[2]  http://www.opencyc.org
[3]  http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/
[4]  http://dbpedia.org
[5]  http://www.umbel.org/documentation.html
[6]  http://umbel.zitgist.com/
[7]  http://www.slideshare.net/mkbergman/
[8] 
http://fgiasson.com/blog/index.php/2008/07/16/starting-to-play-with-the-umbel-ontology/

[9]  http://www.mkbergman.com/?p=449
[10] http://groups.google.com/group/umbel-ontology/






UK Government moves forward with Data Sharing, APIs and Mashup Contest

2008-07-03 Thread Chris Bizer


Hi all,

very promising developments in the UK. See:

http://blog.programmableweb.com/2008/07/04/uk-government-moves-forward-with-data-sharing-apis-and-mashup-contest/

Maybe someone from this list wants to win this price with some nice Linked 
Data Mashup based on the UK government data?


Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de 





Re: Linked Data re Non-Profits and NGO's. Have data, need vocabulary.

2008-06-26 Thread Chris Bizer
Hi Bob and Kingsley,

a while ago there was a RDF version of Wikicompany online at 
http://dbpedia.openlinksw.com/wikicompany/resource/Wikicompany

Maybe it would also be an idea to reuse terms from the vocabulary of this 
source.

Kingsley: The URI abouve currently gives a 500 return code. Do you know what 
happend to the site?

As an alternative you could also think about reusing the terms that are 
currently used within DBpedia to describe organizations.

Cheers

Chris

--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de
  - Original Message - 
  From: Bob Wyman 
  To: public-lod@w3.org 
  Sent: Friday, June 27, 2008 1:19 AM
  Subject: Linked Data re Non-Profits and NGO's. Have data, need vocabulary.


  I would like to make available as Linked Data several databases describing 
several million non-profits, NGO's and foundations. The data includes things 
like name of organization, address, budget, source of funds, major programs, 
key personnel, relationships to other organizations, area of expertise, etc.

  What I don't have is an RDF vocabulary with which to describe these things. 
While I could define one myself, I would like to base my work on existing 
standards," or common practice, however, seemingly endless digging through the 
web indicates that there aren't any obvious "standards" for describing even 
basic things like address in RDF. Perhaps, I'm looking in the wrong places...

  Ideally, I would find some well formed vocabulary for a "Business 
Description" that I could use or adapt. I would appreciate it if anyone could 
give me pointers to either such a well worked vocabulary or at least to smaller 
vocabularies for things like address that I could use in composing a vocabulary 
with which to publish this data. Can you help?

  bob wyman



7th International Semantic Web Conference (ISWC2008) - Call for Posters and Demonstrations

2008-06-19 Thread Chris Bizer


-

CFP: ISWC 2008 POSTERS AND DEMONSTRATIONS

http://iswc2008.semanticweb.org/calls/call-for-posters-and-demonstrations/

To be held as part of 7th International Semantic Web Conference
26-30 October 2008, Karlsruhe, Germany
-

ISWC 2008 will hold combined poster and demonstration sessions. The
Poster/Demo Session is an opportunity for presenting late-breaking
results, ongoing research projects, and speculative or innovative work
in progress. Posters and demos are intended to provide authors and
participants with the ability to connect with each other and to engage
in discussions about the work. Technical posters, reports on Semantic
Web software systems, descriptions of completed work, and work in
progress are all welcome. Demonstrations are intended to showcase
innovative Semantic Web related implementations and technologies.


Important Dates
--
* July 25, 2008: Deadline for submissions
* September 5, 2008: Notification of acceptance
* September 19, 2008: Camera ready abstracts due

(time: 23:59 pm GMT-10 (Hawaii))


Submission Information
-
Authors must submit a two-page paper with a short abstract for
evaluation. The abstract must clearly demonstrate relevance to the
Semantic Web. Submissions will be evaluated for acceptability by the
reviewers. Decisions about acceptance will be based on relevance to
the Semantic Web, originality, potential significance, topicality and
clarity. If a poster will be accompanied by a live software
demonstration, authors are requested to submit an additional one-page
explanation of the demo, which will not be included in the poster/demo
notes. Submit the demo explanation as the third page of your two-page
paper.

A detailed list of suggested topics can be found in the calls for
papers both for the Research Track and for the Semantic Web In Use
track. Posters and demos are intended to convey a scientific result or
work in progress and are not intended as advertisements for software
packages.

Authors submitting a full paper to another track in ISWC 2008 may also
submit the same work for consideration in the Demo/Poster track,
either before or after result notification for the full paper. For
example, a demo can be provided for an accepted paper, or a poster can
be used to present work that was insufficiently mature for the
research track.

At least one of the Poster/Demo authors must be a registered
participant at the conference, and attend the Poster/Demo Session to
present the work. The abstracts for all accepted posters and demos
will be given to all conference attendees and published on the
conference web site. They will not be included in the formal
proceedings.

Poster and demo abstracts must be formatted in the style of the
Springer Publications format for Lecture Notes in Computer Science
(LNCS). For complete details, see Springer's Author Instructions.
Abstracts must be submitted in PDF format. Abstracts will not be
accepted in any other format. Abstracts that exceed the page limit
will be rejected without review. Please submit papers at the ISWC 2008
submission page.

Copyright forms will be required for all accepted papers. Details will
be sent with notifications of acceptance.


Further Information
--
For further information and for any questions regarding the event or
submissions, please contact the Posters and Demonstration co-chairs
Chris Bizer and Anupam Joshi.


Posters and Demonstrations co-chairs
------

Chris Bizer, Freie Universität Berlin, Germany
Anupam Joshi, University of Maryland, Baltimore County, USA







Re: More ESWC 2008 Linked Data Playground

2008-06-09 Thread Chris Bizer


Hi all,

some quick comments on this thread:

1. I'm currently swamped with work. Therefore getting the final version of 
the ESWC dataset (including some bug fixes, updated ESWC conf data website, 
...) out won't happen before the end of the week. The same is true for the 
WWW2008 dataset.


2. We developed a EasyChair XML dump to RDF conversion script which can be 
used by the dogfood project for further conferences and will reduce the 
effort to generate the RDF data.


3. It is very easy to point at other people telling them that they should 
publish more Linked Data, support RDFa and support any other standard that 
comes to mind. Could the people pointing please check first how much data 
they have published on the Semantic Web so far.


4. Concerning attacking the problem from both sides. Yes, sure. But this is 
exactly what the LOD project is about. The LOD community is getting more 
data out onto the Semantic Web (see http://richard.cyganiak.de/2007/10/lod/) 
and is working on browsers, search engines and Linked Data mashups which 
will produce the required application pull to make it beneficial for data 
providers to publish data on the Web (see 
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/SemWebClients 
and 
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/SemanticWebSearchEngines).


So the community is in the process of solving the problem and I don't see 
any reason to repeat the old chicken-egg discussions over and over.


Cheers

Chris

--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de
- Original Message - 
From: "Aldo Bucchi" <[EMAIL PROTECTED]>

To: "David Huynh" <[EMAIL PROTECTED]>
Cc: 
Sent: Monday, June 09, 2008 11:16 PM
Subject: Re: More ESWC 2008 Linked Data Playground




On Mon, Jun 9, 2008 at 4:33 PM, David Huynh <[EMAIL PROTECTED]> wrote:


Daniel,

Thank you for your detailed reply! I'm glad there is tremendous progress
toward a platform for conference metadata. And I applaud such an effort 
and

everyone involved.

You said, "a task which requires additional resources is less likely to 
be

pursued." May I suggest a different observation: "a task that has no
immediate, personal benefit, or instant gratification, is less likely to 
be

pursued." Whether you add that  makes no observable
difference to anyone, including yourself, who has just a standard web
browser. Humans are known to generally optimize for short-term, personal
gains over long-term prospects for humanity. So, why expend the effort to
add that line of HTML code? Am I making sense to you?

If we look at this problem from a "return on investment" point of view, 
then
your suggestion for "remove as many barriers as possible" is about 
lowering
the investment--which I totally agree. My suggestion is about increasing 
the

return from "no observable difference" to "some observable difference". I
was simply wondering what people on this mailing list have done to 
achieve

"some observable difference," that's all.


Yes, there is a definite lack of incentive to publish structured data.
It all comes down to the fact that generation and consumption are now
totally dissociated.

In the HTML world, the publisher provides the UI, and therefore the
context in which data is consumed. This allows for a straighforward
way to exploit the generated attention upon consumption time: Ads,
self-promotion, socialization ( comments ), etc.

This is, from my POV, the most underlooked aspect of the Semweb (
please correct me if I am wrong ). Alas, it is not one that we can
design upfront. Social dynamics and economics will define, in the long
term, the semantic data value chains.

Any pointers on this topic?



I think attacking the problem from both ends will help us make progress
faster.

Best regards,

David
P.S. I believe I already got all the answers to my questions here, so for 
me

there's no need to continue this thread, unless you want to.







--
 Aldo Bucchi 
+56 9 7623 8653
skype:aldo.bucchi






BlogPost: How to extract useful OWL from Freebase

2008-06-09 Thread Chris Bizer


Hi,

there seams to be some progress around getting RDF out of Freebase. See:

http://clockwerx.blogspot.com/2008/06/how-to-extract-useful-owl-from-freebase.html

Maybe a good starting point for a Freebase Linked Data wrapper 

Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de 





Re: ESWC 2008 Linked Data Playground

2008-05-29 Thread Chris Bizer


Hi David,

Now, with all that open linked data, how much work does it take to get 
lat/lng of 2 dozen well-known organizations? Should be trivial, right?


Yes, trivial and very useful indeed.

If somebody would go though the effort and put them online somewhere, we 
would be more than happy to set RDF links from the ESWC 2008 to this data.


Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de
- Original Message - 
From: "David Huynh" <[EMAIL PROTECTED]>

To: <[EMAIL PROTECTED]>
Cc: 
Sent: Thursday, May 29, 2008 4:35 PM
Subject: Re: ESWC 2008 Linked Data Playground




Kingsley Idehen wrote:


David Huynh wrote:


Hi Richard,

If you look at this version


http://people.csail.mit.edu/dfhuynh/projects/eswc2008/eswc2008-rdf.html

you'll see that the RDF/XML file is linked directly. So, there's pretty 
much zero cost in getting RDF into Exhibit (it took me about 5 minutes 
to put that exhibit up). Exhibit automatically routes RDF/XML files 
through Babel for conversion. In the first version, I did that 
conversion and saved the JSON output so that Babel won't get invoked 
every time someone views the exhibit. That's an optimization. Of course, 
Babel isn't perfect in doing the conversion.


Here is an iPhone mockup version for the same exhibit:

http://people.csail.mit.edu/dfhuynh/projects/eswc2008/iphone/iphone-exhibit.html
I only tested it on Firefox and Safari. I think the back button 
functionality doesn't quite work that well, but you get the idea.


David

David,

Even if you don't use RDFa to express what <> is about i.e. it's 
foaf:primarytopic, foaf:topic etc..


In the Exhibit pages  You can accompany:
href="http://data.semanticweb.org/dumps/conferences/eswc-2008-complete.rdf"; 
type="application/rdf+xml" rel="exhibit/data" />

with
href="http://data.semanticweb.org/dumps/conferences/eswc-2008-complete.rdf"; 
type="application/rdf+xml" />


I think we need to adopt a multi pronged approach to exposing Linked Data 
(the raw data behind the Web Page):


1. Content Negotiation (where feasible)
2.   (for RDF sniffers/crawlers)
3. RDFa


Re. point 2, I've just taken a random person "Abhita Chugh 
<http://demo.openlinksw.com/rdfbrowser2/?uri=http%3A%2F%2Fsemanticweb.org%2Fwiki%2FAbhita_Chugh>" 
from <http://data.semanticweb.org> which exposes the RDF based 
Description of "Abhita Chugh" 
<http://demo.openlinksw.com/rdfbrowser2/?uri=http%3A%2F%2Fsemanticweb.org%2Fwiki%2FAbhita_Chugh> 
via our RDF Browser without problems (we use all 3 of the methods above 
to seek "Linked Data" associated with a Web Document). In this case it 
also eliminates the need to translate anything (e.g. routing via Babel) 
since the original data source is actually RDF.


Of course, I could take the exhibit page and slap this in myself, but I 
am hoping you could tweak Exhibit such that it does point 2 and maybe 
point 3 automatically. That would be a major boost re. Exhibit's Linked 
Data credentials :-)
Exhibit can't do #2 because it only acts on the page at runtime, so the 
author of an exhibit must put that in herself. And that I just did for the 
ESWC 2008 exhibits.


BTW, Semtech 2008 has a cool Exhibit-backed event browser:
   http://www.semantic-conference.com/scheduler/
Maybe future *SWC conferences would have use for the same service.

So, it'd be good to get lat/lng coordinates for the affiliations and then 
plot the speakers on a map, like what I did for ISWC 2007 (just for 
kicks):


http://people.csail.mit.edu/dfhuynh/projects/graph-based-exhibit/graph-based-exhibit2.html
Now, with all that open linked data, how much work does it take to get 
lat/lng of 2 dozen well-known organizations? Should be trivial, right?


David







ESWC 2008 Linked Data Playground

2008-05-28 Thread Chris Bizer



Hi all,

Paul, Richard, Knud, Tom, Sean, Denny and I have published some data 
describing papers and authors of the 5th European Semantic Web Conference 
(ESWC2008).


The data is interlinked with DBpedia, Revyu and the Semantic Web Community 
Wiki. So if you add a review about a paper to Revyu or if you add something 
to the wiki, your data will mix nicely with the data that is already 
published about the conference.


See http://data.semanticweb.org/conference/eswc/2008/html
for a description of the dataset and its use cases.

The data is currently also being crawled by several Semantic Web Search 
Engines and we hope to be able to add a section about how to use the data 
within these Search Engines before the weekend.


If you have any ideas for further use cases or know about other Semantic Web 
client applications that could be used to navigate, visualize or search the 
data please let me know so that we can add them to the webpage.


Have fun playing with the data!

Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de 





Open Library provides API to access 13.4 million books

2008-05-16 Thread Chris Bizer


Hi all,

I just saw this on programmable web:

The Open Library has released an API now providing access to metadata about 
13.4 million books, including over 234,000 records with full-text for the 
book. See:


http://blog.programmableweb.com/2008/05/16/open-library-api-cataloging-13-million-books/

Does anybody know if they are also working on a Linked Data interface to 
their catalog?


If not, this API really cries for a Linked Data wrapper :-)

Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de 





Re: [Dbpedia-discussion] linking conceptnet to Dbpedia, Yago

2008-05-11 Thread Chris Bizer


Hi Akshay,

Conceptnet is a commonsense database in form of a large semantic network 
containing

150,000 concepts and 700,000 assertions. http://conceptnet.media.mit.edu/
The concepts are linked to each other by 23 predicates. ( eg. Dog isa 
pet )
currently i am trying to bring this dataset ( creative commons 3.0 
license )

on semantic web by linking with dbpedia.


The description of Conceptnet on the project page sounds like this ontology 
could be very useful within many projects and it would be great to have it 
on the Semantic Web!


However as i am new to this field i request your advice on how it can be 
done.
Till now i have been successful in linking the good quality (more than one 
assertion)

concepts from concpetnet with cwcc classes and yago classes.
( the concepts are linked with categories as all member of the
categories can be considered as related to the concept)
The data is available from here

aubhat2.googlepages.com

Before linking the concepts with wikipedia articles (as done in cyc 
dataset)
i would like to get your opinions regarding the similarity metrics which 
can be used.


Frederick and Mike (cc'ed) are currently doing similar work with 
interlinking their UMBEL ontology http://www.umbel.org/ with DBpedia and 
Yago.


I guess they would be the right people to talk to about similarity metrics.

For information about publishing your data, you could have a look at 
http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/


Keep on the good work!

Cheers

Chris




--
akshay uday bhat.
department of chemical engineering
university institute of chemical technology
mumbai India


___
Dbpedia-discussion mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion 





Linked Data talks at XTech in Dublin (6th-9th May)

2008-05-02 Thread Chris Bizer


Hi

there are plenty of interesting Linked Data talks at XTech this year. For 
instance:


1. Linked Data Deployment (Daniel Lewis, OpenLink)
   http://2008.xtech.org/public/schedule/detail/561

2. The Programmes Ontology (Tom Scott, BBC and all)
   http://2008.xtech.org/public/schedule/detail/524

   There really seams to be massive stuff happening inside BBC :-)

3. SemWebbing the London Gazette (Jeni Tennison, The Stationery Office)
   http://2008.xtech.org/public/schedule/detail/528

  A second industry deployment talk.

4. Searching, publishing and remixing a Web of Semantic Data (Giovanni 
Tummarello,  DERI Galway)

   http://2008.xtech.org/public/schedule/detail/583

5. Building a Semantic Web Search Engine: Challenges and Solutions (Aidan 
Hogan, DERI Galway)

   http://2008.xtech.org/public/schedule/detail/477

  So, two Semantic Web search engines being presented.

6. 'That's not what you said yesterday!' - evolving your Web API (Ian Davis, 
Talis)

   http://2008.xtech.org/public/schedule/detail/550


and a complete track on Open Data
http://2008.xtech.org/public/schedule/topic/23

Great program. Too bad that I can not be there :-(

Cheers

Chris


--
Prof. Dr. Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de 





Good tutorial giving an Introduction into the Web of Data held by Metaweb at the Web 2.0 Summit

2008-04-29 Thread Chris Bizer



Hi,

I just scanned though the slides of the "Creating Semantic Mashups: Bridging 
Web 2.0 and the Semantic Web" tutorial that Jamie, Colin and Toby from 
Metaweb held at the Web 2.0 Summit.


See: http://en.oreilly.com/webexsf2008/public/schedule/detail/2961

Gives a great overview about the ideas behind the Web of Data and relates 
them nicely to classic Web applications and the Web 2.0 mashups.


Worth reading!

Abstract:

This tutorial will identify how the architecture of participation can be 
extended by combining open data and open source semantic technologies. We 
will use simple, hands-on examples to expose participants to semantic 
techniques that are possible today. The tutorial will culminate by working 
through the development of a semantic widget for movie reviews that makes 
use of the techniques described.


Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de 





Re: Linking non-open data

2008-04-18 Thread Chris Bizer


Hi Peter,

reading your "ramblings", they actually make a lot of sence to me and I 
think I even like them better than my own initial ideas on the problem as 
your approach nicely avoids the owl:sameAs.


Anybody else further ideas?

Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de
- Original Message - 
From: Peter Coetzee

To: Chris Bizer
Cc: Matthias Samwald ; public-lod@w3.org ; Tassilo Pellegrini ; Andreas 
Blumauer (Semantic Web Company)

Sent: Friday, April 18, 2008 11:06 AM
Subject: Re: Linking non-open data


Hi Chris,

I like the sound of this, as a neat and elegant way to work round the 
problem. The only concern I'd have, is that it lacks any "backwards" links 
from the protected to the public data object. For example, if my agent finds 
the triple



http://mydomain//resource/myResource foaf:interest
http://yourDomain/resource/ProtectedDataAboutObjectX

in some document out there, and *doesn't* have (and cannot get) the 
credentials to access http://yourDomain/resource/ProtectedDataAboutObjectX, 
it has no way of knowing that it might be able to get some data about the 
"real" object (please excuse my loose language!) being discussed from 
http://yourDomain/resource/PublicDataAboutObjectX, or does it? Note, I'm 
assuming here that my agent hasn't encountered the owl:sameAs elsewhere on 
its 'travels'.


I guess there are two obvious solutions to me; either every time we refer to 
ProtectedDataAboutObjectX, we must also include the owl:sameAs to 
PublicDataAboutObjectX, or we must always refer to PublicDataAboutObjectX 
and rely on its linked-ness into ProtectedDataAboutObjectX to get at that 
data if we have credentials. Hmmm - both feel a little bit cumbersome to me, 
what do you think?


On a slightly separate (and less tangible) note, I feel slightly 
uncomfortable with the notion of "refer to that URI about ObjectX because I 
know what data it will serve" - in theory (when the whole world is 
passionate about interlinking their datasets ;) ), shouldn't it be ok to 
refer to any URI for the object, and (perhaps eventually) get to whichever 
data you seek? I recognise that in practise this would be unnecessarily 
inefficient, but stick with me for a minute! As an extension of that 
feeling, it strikes me as odd to mint two different URIs for the same thing, 
solely to get around a mechanical issue like authentication. Perhaps what 
I'm getting at then is something more along the lines of:


1. Use the resource http://yourDomain/resource/ObjectX to refer to the 
resource itself (always)
2. When someone dereferences http://yourDomain/resource/ObjectX, they are 
required to attempt to authenticate
3a. If the client fails to authenticate, they are presented with only the 
public data - perhaps by using a suitable redirect to 
http://yourDomain/resource/PublicDataAboutObjectX (note - no owl:sameAs 
needed, as we're always referring to http://yourDomain/resource/ObjectX)
3b. If the client provides sufficient credentials, they are presented with 
the protected data as well (again, either directly or through a redirect to 
http://yourDomain/resource/ProtectedDataAboutObjectX; whichever is deemed to 
be more "pure")


This mechanism would also permit the server on http://yourDomain/ to serve 
different facets on the data depending on the user who has authenticated 
(e.g. it may be that a "student" user can't see as much data as a 
"supervisor", etc). It also removes (I think?) the risk of agents reaching 
an unnecessary dead-end when they follow a link to 
http://yourDomain/resource/ProtectedDataAboutObjectX.


Apologies for the fairly rambling train of thought - I hope it was vaguely 
coherent!


Any thoughts?

Cheers,
Peter



On Fri, Apr 18, 2008 at 4:21 AM, Chris Bizer <[EMAIL PROTECTED]> wrote:

Hi Peter,

One of the problems this presents though, is how to advertise the data 
that's
available for a user. Perhaps something like the Semantic Web Sitemap 
Extension
[1] could be used / extended to say what data is available behind this 
authentication,
so that an agent knows whether or not it's interested in trying to find 
credentials for it

(e.g. prompting a user)?


Building on the Sitemap Extension would be one option, but I think 
advertising could also work much simpler just by setting RDF links to the 
access protected resources.


So you could do have something like this:

1. Use http://yourDomain/resource/PublicDataAboutObjectX to identify your 
resource and the public data about it.


2. If some client dereferences this URI it would get the public data 
containing a RDF link like this


http://yourDomain/resource/PublicDataAboutObjectX owl:sameAs 
http://yourDomain/resource/ProtectedDataAboutObjectX


3. If the client would then try to dererference 
http://yourDomain/resource/Protecte

Re: Linking non-open data

2008-04-17 Thread Chris Bizer
Hi Peter,

> One of the problems this presents though, is how to advertise the data that's 
> available for a user. Perhaps something like the Semantic Web Sitemap 
> Extension
> [1] could be used / extended to say what data is available behind this 
> authentication, 
> so that an agent knows whether or not it's interested in trying to find 
> credentials for it 
> (e.g. prompting a user)?

Building on the Sitemap Extension would be one option, but I think advertising 
could also work much simpler just by setting RDF links to the access protected 
resources.

So you could do have something like this:

1. Use http://yourDomain/resource/PublicDataAboutObjectX to identify your 
resource and the public data about it.

2. If some client dereferences this URI it would get the public data containing 
a RDF link like this 

http://yourDomain/resource/PublicDataAboutObjectX owl:sameAs 
http://yourDomain/resource/ProtectedDataAboutObjectX

3. If the client would then try to dererference 
http://yourDomain/resource/ProtectedDataAboutObjectX it would be asked to 
provide some credentials.

Using this mechanism, external data providers could also link to the protected 
data, for instance:

http://mydomain//resource/myResource foaf:interest
http://yourDomain/resource/ProtectedDataAboutObjectX

What do you think?

Cheers

Chris


--
Chris Bizer
Freie Universität Berlin
+49 30 838 54057
[EMAIL PROTECTED]
www.bizer.de
  - Original Message ----- 
  From: Peter Coetzee 
  To: Chris Bizer ; Matthias Samwald 
  Cc: public-lod@w3.org ; Tassilo Pellegrini ; Andreas Blumauer (Semantic Web 
Company) 
  Sent: Thursday, April 17, 2008 2:03 PM
  Subject: Re: Linking non-open data


  Hi all,


  On Thu, Apr 17, 2008 at 12:25 PM, Chris Bizer <[EMAIL PROTECTED]> wrote:


Hi Matthias,




  A question that will surely arise in many places when more people get to 
know about the linked data initiative and the growing infrastructure of linked 
open data is: how can these principles be applied to organizational data that 
might not / only partially be open to the public web?



I think applying the Linked Data principles within a corporate intranet 
does not pose any specific requirements. It is just that the data is not 
accessable from the outside.

  It sounds to me like deploying linked data over an intranet would be towards 
the "trivial" side of solutions - what about when data is out on (dare I say, 
in? ;) ) the web fully, but you need to control access to it (i.e. the 
authentication Matthias describes). I like the idea of using standard HTTP 
authentication for this - it just seems like the "right" mechanism to use. One 
of the problems this presents though, is how to advertise the data that's 
available for a user. Perhaps something like the Semantic Web Sitemap Extension 
[1] could be used / extended to say what data is available behind this 
authentication, so that an agent knows whether or not it's interested in trying 
to find credentials for it (e.g. prompting a user)?





  People will soon try to develop practices for selectively protecting 
parts of their linked data with fine-grained access rights. Could simple HTTP 
authentication be useful for linked data?



As Linked Data heavily relies on HTTP anyway, I think HTTP authentication 
should be the first choise and people having these requirements shoud check if 
they can go with HTTP auth.



  How does authentication work for SPARQL endpoints containing several 
named graphs?



Of course you can always make things as difficult as you like. But I guess 
for many use cases an all-or-nothing aproach is good enough, which would allow 
HTTP authentication to be used again.

  If you wanted slightly more fine-grained control, I don't see any reason you 
can't still use HTTP auth - if you pass the authenticated user details through 
to whatever framework you're using on the backend to handle SPARQL, and then 
check "does this user have permissions" for each of the named graphs mentioned 
in the query.
   



  Can we use RDF vocabularies to represent access rights? Should such 
vocabularies be standardized?



Sure, but I think all work in this area should be based on clearly 
motivated real-world use cases and collecting these use cases should be the 
first step before starting to define vocabularies.



  Is there any ongoing work on defining such practices (or even 'best 
practices')?



There is lots of work on using RDF, OWL and different rules languages to 
represent access control proicies. See for instance Rei, KAoS and Protune or 
the SemWeb policy workshop at http://www.l3s.de/~olmedilla/events/2006/SWPW06/ 
, for older work also http://www4.wiwiss.fu-berlin.de/bizer/SWTSGuide/

But I guess a lot of this will be a bit over-the-top for the common linked 
data use cases.

Cheers

Chris





  Che

  1   2   >