Re: [Wikidata-l] Wikidata RDF exports

emw Sat, 14 Jun 2014 06:44:26 -0700

Markus,

Thanks for the thorough reply!


you can use SPARQL 1.1 transitive closure in queries (using "*" after
> properties), so you can find "all subclasses" there too. (You could also
> try this in Protege ...)


I had a feeling I was missing something basic.  (I'm also new to SPARQL.)
Using "*" after the property got me what I was looking for by default in
Protege.  That is,

SELECT ?subject
WHERE
{
   ?subject rdfs:subClassOf* <http://www.wikidata.org/entity/Q82586> .
}

-- with an asterisk after rdfs:subClassOf -- got me the transitive closure
and returned all subclasses of Q82586 / "lepton".

Should we maybe create an English label file for the classes? Descriptions
> too or just labels?
>

A file with English labels and descriptions for classes would be great and,
I think, address this use case.  Per your note, I suppose one would simply
concatenate that English terms file and wikidata-taxonomy.nt into a new .nt
file, then import that into Protege to explore the class hierarchy.
(Having every line in the ontology be self-contained in N3 is very
convenient!)

Regarding the pruned subset, I think the command-line approach in your
examples is enough for me to get started making my own.

I won't have time to experiment with these things for a few weeks, but I
will return to this then and let you know any interesting findings.

Cheers,
Eric


On Sat, Jun 14, 2014 at 4:41 AM, Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Eric,
>
> Two general remarks first:
>
> (1) Protege is for small and medium ontologies, but not really for such
> large datasets. To get SPARQL support for the whole data, you could to
> install Virtuoso. It also comes with a simple Web query UI. Virtuoso does
> not do much reasoning, but you can use SPARQL 1.1 transitive closure in
> queries (using "*" after properties), so you can find "all subclasses"
> there too. (You could also try this in Protege ...)
>
> (2) If you want to explore the class hierarchy, you can also try our new
> class browser:
>
> http://tools.wmflabs.org/wikidata-exports/miga/?classes
>
> It has the whole class hierarchy, but without the "leaves" (=instances of
> classes + subclasses that have no own subclasses/instances). For example,
> it tells you that "lepton" has 5 direct subclasses, but shows only one:
>
> http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338
>
> On the other hand, it includes relationships of classes and properties
> that are not part of the RDF (we extract this from the data by considering
> co-occurrence). Example:
>
> "Classes that have no superclasses but at least 10 instances, and which
> are often used with the property 'sex or gender'":
>
> http://tools.wmflabs.org/wikidata-exports/miga/?
> classes#_cat=Classes/Direct%20superclasses=__null/Number%
> 20of%20direct%20instances=10%20-%2020000/Related%
> 20properties=sex%20or%20gender
>
> I already added superclasses for some of those in Wikidata now -- data in
> the browser is updated with some delay based on dump files.
>
>
> More answers below:
>
>
> On 14/06/14 05:52, emw wrote:
>
>> Markus,
>>
>> Thank you very much for this.  Translating Wikidata into the language of
>> the Semantic Web is important.  Being able to explore the Wikidata
>> taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
>> queries) is really neat, e.g.
>>
>> SELECT ?subject
>> WHERE
>> {
>>     ?subject rdfs:subClassOf <http://www.wikidata.org/entity/Q82586> .
>> }
>>
>> This is more of an issue of my ignorance of Protege, but I notice that
>> the above query returns only the direct subclasses of Q82586.  The full
>> set of subclasses for Q82586 ("lepton") is visible at
>> http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&lang=en
>> -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
>> electron neutrino) are shown there but not returned by that SPARQL
>> query.  It seems rdfs:subClassOf isn't being treated as a transitive
>> property in Protege.  Any ideas?
>>
>
> You need a reasoner to compute this properly. For a plain class hierarchy
> as in our case, ELK should be a good choice [1]. You can install the ELK
> Protege plugin and use it to classify the ontology [2]. Protege will then
> show the copmuted class hierarchy in the browser; I am not sure what
> happens to the SPARQL queries (it's quite possible that they don't use the
> reasoner).
>
> [1] https://code.google.com/p/elk-reasoner/
> [2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege
>
>
>
>> Do you know when the taxonomy data in OWL will have labels available?
>>
>
> We had not thought of this as a use case. A challenge is that the label
> data is quite big because of the many languages. Should we maybe create an
> English label file for the classes? Descriptions too or just labels?
>
>
>
>> Also, regarding the complete dumps, would it be possible to export a
>> smaller subset of the faithful data?  The files under "Complete Data
>> Dumps" in
>> http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
>> big to load into Protege on most personal computers, and would likely
>> require adjusting JVM settings on higher-end computers to load.  If it's
>> feasible to somehow prune those files -- and maybe even combine them
>> into one file that could be easily loaded into Protege -- that would be
>> especially nice.
>>
>
> What kind of "pruning" do you have in mind? You can of course take a
> subset of the data, but then some of the data will be missing.
>
> A general remark on mixing and matching RDF files. We use N3 format, where
> every line in the ontology is self-contained (no multi-line constructs, no
> header, no namespaces). Therefore, any subset of the lines of any of our
> files is still a valid file. So if you want to have only a slice of the
> data (maybe to experiment with), then you could simply do something like:
>
> gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt
>
> "head" simply selects the first 10000 lines here. You could also use grep
> to select specific triples instead, such as:
>
> zgrep "http://www.w3.org/2000/01/rdf-schema#label"; wikidata-terms.nt.gz |
> grep "@en ." > en-labels.nt
>
> This selects all English labels. I am using zgrep here for a change; you
> can also use gunzip as above. Similar methods can also be used to count
> things in the ontology (use grep -c to count lines = triples).
>
> Finally, you can combine multiple files into one by simply concatenating
> them in any order:
>
> cat partial-data-1.nt > mydata.nt
> cat partial-data-2.nt >> mydata.nt
> ...
>
> Maybe you can experiment a bit and let us know if there is any export that
> would be particularly meaningful for you.
>
> Cheers,
>
> Markus
>
>
>> Thanks,
>> Eric
>> https://www.wikidata.org/wiki/User:Emw
>>
>> 1.
>> http://tools.wmflabs.org/wikidata-exports/rdf/exports/
>> 20140526/wikidata-taxonomy.nt.gz
>> 2. http://protege.stanford.edu/
>>
>>
>>
>>
>>
>> On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch
>> <markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
>>
>> wrote:
>>
>>     Hi all,
>>
>>     We are now offering regular RDF dumps for the content of Wikidata:
>>
>>     http://tools.wmflabs.org/__wikidata-exports/rdf/
>>
>>     <http://tools.wmflabs.org/wikidata-exports/rdf/>
>>
>>     RDF is the Resource Description Framework of the W3C that can be
>>     used to exchange data on the Web. The Wikidata RDF exports consist
>>     of several files that contain different parts and views of the data,
>>     and which can be used independently. Details on the available
>>     exports and the RDF encoding used in each can be found in the paper
>>     "Introducing Wikidata to the Linked Data Web" [1].
>>
>>     The available RDF exports can be found in the directory
>>     http://tools.wmflabs.org/__wikidata-exports/rdf/exports/
>>     <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New
>>
>>     exports are generated regularly from current data dumps of Wikidata
>>     and will appear in this directory shortly afterwards.
>>
>>     All dump files have been generated using Wikidata Toolkit [2]. There
>>     are some important differences in comparison to earlier dumps:
>>
>>     * Data is split into several dump files for convenience. Pick
>>     whatever you are most interested in.
>>     * All dumps are generated using the OpenRDF library for Java (better
>>     quality than ad hoc serialization; much slower too ;-)
>>     * All dumps are in N3 format, the simplest RDF serialization format
>>     that there is
>>     * In addition to the faithful dumps, some simplified dumps are also
>>     available (one statement = one triple; no qualifiers and references).
>>     * Links to external data sets are added to the data for Wikidata
>>     properties that point to datasets with RDF exports. That's the
>>     "Linked" in "Linked Open Data".
>>
>>     Suggestions for improvements and contributions on github are welcome.
>>
>>     Cheers,
>>
>>     Markus
>>
>>     [1]
>>     http://korrekt.org/page/__Introducing_Wikidata_to_the___
>> Linked_Data_Web
>>     <http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web>
>>     [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
>>     <https://www.mediawiki.org/wiki/Wikidata_Toolkit>
>>
>>     --
>>     Markus Kroetzsch
>>     Faculty of Computer Science
>>     Technische Universität Dresden
>>     +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
>>     http://korrekt.org/
>>
>>
>>     _________________________________________________
>>     Wikidata-l mailing list
>>     Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
>> >
>>     https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
>>
>>
>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> Wikidata-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>>
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF exports

Reply via email to