Sent: Tuesday, May 08, 2018 at 2:29 PM
From: "Sebastian Hellmann" <hellm...@informatik.uni-leipzig.de>
To: "Discussion list for the Wikidata project" <wikidata@lists.wikimedia.org>, "Laura
Morales" <laure...@mail.com>
Subject: Re: [Wikidata] DBpedia Databus (alpha version)
Hi Laura,
I don't understand, is this just another project built on DBPedia, or a project to replace DBPedia entirely?
a valid question. DBpedia is quite decentralised and hard to understand in its entirety. So actually some parts are improved and others will be replaced eventually (also an improvement, hopefully).
The main improvement here is that, we don't have large monolithic releases that take
forever anymore. Especially the language chapters and also the professional community can
work better with the "platform" in terms of turnaround, effective contribution
and also incentives for contribution. Another thing that will hopefully improve is that
we can more sustainably maintain contributions and add-ons, which were formerly lost
between releases. So the structure and processes will be clearer.
The DBpedia in the "main endpoint" will still be there, but in a way that
nl.dbpedia.org/sparql or wikidata.dbpedia.org/sparql is there. The new hosted service
will be more a knowledge graph of knowledge graph, where you can get either all
information in a fused way or you can quickly jump to the sources, compare and do
improvements there. Projects and organisations can also upload their data to query it
there themselves or share it with others and persist it. Companies can sell or advertise
their data. The core consists of the Wikipedia/Wikidata data and we hope to be able to
improve it and also send contributors and contributions back to the Wikiverse.
Are you a DBPedia maintainer?
Yes, I took it as my task to talk to everybody in the community over the last
year and draft/aggregate the new strategy and innovate.
All the best,
Sebastian
On 08.05.2018 13:42, Laura Morales wrote:
I don't understand, is this just another project built on DBPedia, or a project
to replace DBPedia entirely? Are you a DBPedia maintainer?
Sent: Tuesday, May 08, 2018 at 1:29 PM
From: "Sebastian Hellmann"
<hellm...@informatik.uni-leipzig.de>[mailto:hellm...@informatik.uni-leipzig.de]
To: "Discussion list for the Wikidata project."
<wikidata@lists.wikimedia.org>[mailto:wikidata@lists.wikimedia.org]
Subject: [Wikidata] DBpedia Databus (alpha version)
DBpedia Databus (alpha version)
The DBpedia Databus is a platform that allows to exchange, curate and access data between multiple stakeholders. Any data entering the bus will be versioned, cleaned, mapped, linked and its licenses and provenance tracked. Hosting in multiple formats will be provided to access the data either as dump download or as API. Data governance stays with the data contributors.
Vision
Working with data is hard and repetitive. We envision a hub, where everybody
can upload data and then useful operations like versioning, cleaning,
transformation, mapping, linking, merging, hosting is done automagically on a
central communication system (the bus) and then dispersed again in a decentral
network to the consumers and applications.
On the databus, data flows from data producers through the platform to the
consumers (left to right), any errors or feedback flows in the opposite
direction and reaches the data source to provide a continuous integration
service and improve the data at the source.
Open Data vs. Closed (paid) Data
We have studied the data network for 10 years now and we conclude that
organisations with open data are struggling to work together properly, although
they could and should, but are hindered by technical and organisational
barriers. They duplicate work on the same data. On the other hand, companies
selling data can not do so in a scalable way. The loser is the consumer with
the choice of inferior open data or buying from a djungle-like market.
Publishing data on the databus
If you are grinding your teeth about how to publish data on the web, you can
just use the databus to do so. Data loaded on the bus will be highly visible,
available and queryable. You should think of it as a service:
Visibility guarantees, that your citations and reputation goes up
Besides a web download, we can also provide a Linked Data interface, SPARQL
endpoint, Lookup (autocomplete) or many other means of availability (like AWS
or Docker images)
Any distribution we are doing will funnel feedback and collaboration
opportunities your way to improve your dataset and your internal data quality
You will receive an enriched dataset, which is connected and complemented with
any other available data (see the same folder names in data and fusion folders).
Data Sellers
If you are selling data, the databus provides numerous opportunities for you.
You can link your offering to the open entities in the databus. This allows
consumers to discover your services better by showing it with each request.
Data Consumers
Open data on the databus will be a commodity. We are greatly downing the cost
for understanding the data, retrieving and reformatting it. We are constantly
extending ways of using the data and are willing to implement any formats and
APIs you need.
If you are lacking a certain kind of data, we can also scout for it and load it
onto the databus.
How the Databus works at the moment
We are still in an initial state, but we already load 10 datasets (6 from
DBpedia, 4 external) on the bus using these phases:
Acquisition: data is downloaded from the source and logged in
Conversion: data is converted to N-Triples and cleaned (Syntax parsing,
datatype validation and SHACL)
Mapping: the vocabulary is mapped on the DBpedia Ontology and converted (We
have been doing this for Wikipedia’s Infoboxes and Wikidata, but now we do it
for other datasets as well)
Linking: Links are mainly collected from the sources, cleaned and enriched
IDying: All entities found are given a new Databus ID for tracking
Clustering: ID’s are merged onto clusters using one of the Databus ID’s as
cluster representative
Data Comparison: Each dataset is compared with all other datasets. We have an
algorithm that decides on the best value, but the main goal here is
transparency, i.e. to see which data value was chosen and how it compares to
the other sources.
A main knowledge graph fused from all the sources, i.e. a transparent aggregate
For each source, we are producing a local fused version called the “Databus
Complement”. This is a major feedback mechanism for all data providers, where
they can see what data they are missing, what data differs in other sources and
what links are available for their IDs.
You can compare all data via a webservice (early prototype, just works for Eiffel Tower):
http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general[http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general][http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general[http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general]]
We aim for a real-time system, but at the moment we are doing a monthly cycle.
Is it free?
Maintaining the Databus is a lot of work and servers incurring a high cost. As
a rule of thumb, we are providing everything for free that we can afford to
provide for free. DBpedia was providing everything for free in the past, but
this is not a healthy model, as we can neither maintain quality properly, nor
grow.
On the Databus everything is provided “As is” without any guarantees or
warranty. Improvements can be done by the volunteer community. The DBpedia
Association will provide a business interface to allow guarantees, major
improvements, stable maintenance and hosting.
License
Final databases are licensed under ODC-By. This covers our work on
recomposition of data. Each fact is individually licensed, e.g. Wikipedia
abstracts are CC-BY-SA, some are CC-BY-NC, some are copyrighted. This means
that data is available for research, informational and educational purposes. We
recommend to contact us for any professional use of the data (clearing), so we
can guarantee that legal matters are handled correctly. Otherwise professional
use is at own risk.
Current Stats
Download
The databus data is available at
http://downloads.dbpedia.org/databus/[http://downloads.dbpedia.org/databus/][http://downloads.dbpedia.org/databus/[http://downloads.dbpedia.org/databus/]]
ordered into three main folders:
Data: the data that is loaded on the databus at the moment
Global: a folder that contains provenance data and the mappings to the new IDs
Fusion: the output of the databus
Most notably you can find:
Provenance mapping of the new ids in
global/persistence-core/cluster-iri-provenance-ntriples/[http://downloads.dbpedia.org/databus/global/persistence-core/cluster-iri-provenance-ntriples/[http://downloads.dbpedia.org/databus/global/persistence-core/cluster-iri-provenance-ntriples/]]
and
global/persistence-core/global-ids-ntriples/[http://downloads.dbpedia.org/databus/global/persistence-core/global-ids-ntriples/[http://downloads.dbpedia.org/databus/global/persistence-core/global-ids-ntriples/]]
The final fused version for the core:
fusion/core/fused/[http://downloads.dbpedia.org/databus/fusion/core/fused/[http://downloads.dbpedia.org/databus/fusion/core/fused/]]
A detailed JSON-LD file for data comparison:
fusion/core/json/[http://downloads.dbpedia.org/databus/fusion/core/json/[http://downloads.dbpedia.org/databus/fusion/core/json/]]
Complements, i.e. the enriched Dutch DBpedia Version:
fusion/core/nl.dbpedia.org/[http://downloads.dbpedia.org/databus/fusion/core/nl.dbpedia.org/[http://downloads.dbpedia.org/databus/fusion/core/nl.dbpedia.org/]]
(Note that the file and folder structure are still subject to change)
Sources
Glue
Source
Target
Amount
de.dbpedia.org[http://de.dbpedia.org/[http://de.dbpedia.org/]]www.viaf.org[http://www.viaf.org][http://www.viaf.org/[http://www.viaf.org/]]
387,106
diffbot.com[http://diffbot.com/[http://diffbot.com/]]www.wikidata.org[http://www.wikidata.org][http://www.wikidata.org/[http://www.wikidata.org/]]
516,493
d-nb.info[http://d-nb.info/[http://d-nb.info/]]
viaf.org[http://viaf.org/[http://viaf.org/]]
5,382,783
d-nb.info[http://d-nb.info/[http://d-nb.info/]]
dbpedia.org[http://dbpedia.org/[http://dbpedia.org/]]
80,497
d-nb.info[http://d-nb.info/[http://d-nb.info/]]
sws.geonames.org[http://sws.geonames.org/[http://sws.geonames.org/]]
50,966
fr.dbpedia.org[http://fr.dbpedia.org/[http://fr.dbpedia.org/]]www.viaf.org[http://www.viaf.org][http://www.viaf.org/[http://www.viaf.org/]]
266
sws.geonames.org[http://sws.geonames.org/[http://sws.geonames.org/]]
dbpedia.org[http://dbpedia.org/[http://dbpedia.org/]]
545,815
kb.nl[http://kb.nl/[http://kb.nl/]]
viaf.org[http://viaf.org/[http://viaf.org/]]
2,607,255
kb.nl[http://kb.nl/[http://kb.nl/]]www.wikidata.org[http://www.wikidata.org][http://www.wikidata.org/[http://www.wikidata.org/]]
121,012
kb.nl[http://kb.nl/[http://kb.nl/]]
dbpedia.org[http://dbpedia.org/[http://dbpedia.org/]]
37,676
www.wikidata.org[http://www.wikidata.org][http://www.wikidata.org/[http://www.wikidata.org/]]https://permid.org[https://permid.org][https://permid.org/[https://permid.org/]]
5,133
wikidata.dbpedia.org[http://wikidata.dbpedia.org/[http://wikidata.dbpedia.org/]]www.wikidata.org[http://www.wikidata.org][http://www.wikidata.org/[http://www.wikidata.org/]]
45,344,233
wikidata.dbpedia.org[http://wikidata.dbpedia.org/[http://wikidata.dbpedia.org/]]
sws.geonames.org[http://sws.geonames.org/[http://sws.geonames.org/]]
3,495,358
wikidata.dbpedia.org[http://wikidata.dbpedia.org/[http://wikidata.dbpedia.org/]]
viaf.org[http://viaf.org/[http://viaf.org/]]
1,179,550
wikidata.dbpedia.org[http://wikidata.dbpedia.org/[http://wikidata.dbpedia.org/]]
d-nb.info[http://d-nb.info/[http://d-nb.info/]]
601,665
Plan for the next releases
Include more existing data from DBpedia
Renew all DBpedia releases in a separate fashion:
DBpedia Wikidata is running already:
http://78.46.100.7/wikidata/[http://78.46.100.7/wikidata/][http://78.46.100.7/wikidata/[http://78.46.100.7/wikidata/]]
Basic extractors like infobox properties and mapping will be active soon
Text extraction will take a while
Load all data in the comparison tool:
http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general[http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general][http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general[http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general]]
Load all data into a SPARQL endpoint
Create a simple open source software that let’s everybody push data on the
databus in an automated way
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org[mailto:Wikidata@lists.wikimedia.org] https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]]
_______________________________________________
Wikidata mailing
listwikid...@lists.wikimedia.org[mailto:Wikidata@lists.wikimedia.org]https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org[http://dbpedia.org],
http://nlp2rdf.org[http://nlp2rdf.org],
http://linguistics.okfn.org[http://linguistics.okfn.org],
https://www.w3.org/community/ld4lt[http://www.w3.org/community/ld4lt]
Homepage: http://aksw.org/SebastianHellmann[http://aksw.org/SebastianHellmann]
Research Group: http://aksw.org[http://aksw.org]