Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Is there any RDF dump available of OpenCorporates data? Or even any dump at all? Their licensing terms are ambiguous... They say it's released under ODbL, but if I want to use the data I have to ask permission and they will decide if I can use it for free or if I have to pay a fee :/ Sent: Wednesday, October 25, 2017 at 9:44 AM From: "Jakob Voß" To: wikidata@lists.wikimedia.org Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata Hi Luigi, I favour cooperation with OpenCorporates instead of independently adding lots of company record to Wikidata. Sure there are parallel strategies but any effort should also include OpenCorporates to some degree. OpenCorporates is licensed under ODbL (just added this referenced statement to Q7095760) and we have property P1320 to link Wikidata and OpenCorporates. A first step would be to align https://opencorporates.com/registers https://en.wikipedia.org/wiki/List_of_company_registers Right now we have 18 instances of company register (Q1394657) and its subclasses explicitly classified as such in Wikidata. These items should be linked with the registers listed at OpenCorporates, e.g. UK Companies House (Q257303) = https://opencorporates.com/registers/270[https://opencorporates.com/registers/270] I've also noticed that OpenCorporates has a field for "Identifiers" where Wikidata QIDs may be included to have two-way-links between the two datasets. Anyway, better contact https://opencorporates.com/info/contributing[https://opencorporates.com/info/contributing] at least to let them know about your plans. Cheers, Jakob -- Jakob Voß Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de/[http://www.gbv.de/] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
OK, just asked. Their reply was that they "reserves the right under paragraph 3.3 of ODbL to release the database under different terms", which is to say their data is NOT free because they want to control how and where the data is used. Are we starting to see "free vs open" all over again, this time with data instead of software? Sent: Wednesday, October 25, 2017 at 5:06 PM From: "Thad Guidry" To: "Discussion list for the Wikidata project." Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata Laura, Talk to OpenCorporates and ask those questions yourself. Get involved ! :) -Thad +ThadGuidry[https://plus.google.com/+ThadGuidry] On Wed, Oct 25, 2017 at 3:22 AM Laura Morales mailto:laure...@mail.com]> wrote:Is there any RDF dump available of OpenCorporates data? Or even any dump at all? Their licensing terms are ambiguous... They say it's released under ODbL, but if I want to use the data I have to ask permission and they will decide if I can use it for free or if I have to pay a fee :/ ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Wikidata HDT dump
Hello everyone, I'd like to ask if Wikidata could please offer a HDT [1] dump along with the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT. Thank you. [1] http://www.rdfhdt.org/ [2] https://dumps.wikimedia.org/wikidatawiki/entities/ ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Would it be an idea if HDT remains unfeasible to place the journal file of > blazegraph online? > Yes, people need to use blazegraph if they want to access the files and query > it but it could be an extra next to turtle dump? How would a blazegraph journal file be better than a Turtle dump? Maybe it's smaller in size? Simpler to use? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Dear Laura, others, > > If somebody points me to the RDF datadump of Wikidata I can deliver an > HDT version for it, no problem. (Given the current cost of memory I > do not believe that the memory consumption for HDT creation is a > blocker.) This would be awesome! Thanks Wouter. To the best of my knowledge, the most up to date dump is this one [1]. Let me know if you need any help with anything. Thank you again! [1] https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz --- Cheers, Wouter Beek. Email: wou...@triply.cc WWW: http://triply.cc Tel: +31647674624 On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales wrote: > Hello everyone, > > I'd like to ask if Wikidata could please offer a HDT [1] dump along with the > already available Turtle dump [2]. HDT is a binary format to store RDF data, > which is pretty useful because it can be queried from command line, it can be > used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space > to store the same data. The problem is that it's very impractical to generate > a HDT, because the current implementation requires a lot of RAM processing to > convert a file. For Wikidata it will probably require a machine with > 100-200GB of RAM. This is unfeasible for me because I don't have such a > machine, but if you guys have one to share, I can help setup the rdf2hdt > software required to convert Wikidata Turtle to HDT. > > Thank you. > > [1] http://www.rdfhdt.org/[http://www.rdfhdt.org/] > [2] > https://dumps.wikimedia.org/wikidatawiki/entities/[https://dumps.wikimedia.org/wikidatawiki/entities/] > > ___ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> You can mount te jnl file directly to blazegraph so loading and indexing is > not needed anymore. How much larger would this be compared to the Turtle file? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> is it possible to store a weighted adjacency matrix as an HDT instead of an > RDF? > > Something like a list of entities for each entity, or even better a list of > tuples for each entity. > So that a tuple could be generalised with propoerties. Sorry I don't know this, you would have to ask the devs. As far as I understand, it's a triplestore and that should be it... ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Javier D. Fernández of the HDT team was very quick to fix the link :-) their dump is almost ~1 year old though. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> The first part of the Turtle data stream seems to contain syntax errors for > some of the XSD decimal literals. The first one appears on line 13,291: > > Notice that scientific notation is not allowed in the lexical form of > decimals according to XML > Schema Part 2: > Datatypes[https://www.w3.org/TR/xmlschema11-2/#decimal]. (It is allowed in > floats and doubles.) Is this a known issue or should I report this somewhere? I wouldn't call these "syntax" errors, just "logical/type" errors. It would be great if these could fixed by changing the correct type from decimal to float/double. On the other hand, I've never seen any medium or large dataset without this kind of errors. So I would personally treat these as warnings at worst. @Wouter when you build the HDT file, could you please also generate the .hdt.index file? With rdf2hdt, this should be activated with the -i flag. Thank you again! ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> @Wouter: As Stas said, you might report that error. I don't agree with Laura > who tried to under estimate that "syntax error". It's also about quality ;) Don't get me wrong, I am all in favor of data quality! :) So if this can be fixed, it's better! The thing is, that I've seen so many datasets with these kind of type errors, that by now I pretty much live with them and I'm OK with these warnings (the triple is not broken after all, it's just not following the standards). > @Laura: Do you have a different rdf2hdt program or the one in the GitHub of > HDT project ? I just use https://github.com/rdfhdt/hdt-cpp compiled from the master branch. To verify data instead, I use RIOT (a CL tool from the Apache Jena package) like this `riot --validate file.nt`. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Also, for avoiding your users to re-create the models, you can pre-load > "models" from LOV catalog. The LOV RDF dump is broken instead. Or at least it still was the last time I checked. And I don't broken in the sense of Wikidata, that is with some wrong types, I mean broken as it doesn't validate at all (some triples are broken). ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Thanks to report that. I remember one issue that I added here > https://github.com/pyvandenbussche/lov/issues/66 Yup, still broken! I've tried just now. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> No, the idea is that each organization will have its own KNS, so users can > add the KNS that they want. How would this compare with a traditional SPARQL endpoint + "federated queries", or with "linked fragments"? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> @Laura : you mean this list http://lov.okfn.org/lov.nq.gz ? > I can download it !! > > Which one ? Please send me the URL and I can fix it !! Yes you can download it, but the nq file is broken. It doesn't validate because some URIs contains white spaces, and some triples have an empty subject (ie. <>). ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> KBox is an alternative to other existing architectures for publishing KB such > as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files. > I should add that you can do federated query with KBox as as easier as you > can do with SPARQL endpoints. OK, but I still fail to see what is the value of this? What's the reason why I'd want to use it rather than just start a Fuseki endpoint, or use linked-fragments? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] How to get direct link to image
- wikidata entry: https://www.wikidata.org/wiki/Q161234 - "logo image" property pointing to: https://commons.wikimedia.org/wiki/File:0_A.D._logo.png However... that's a HTML page... How do I get a reference to the .png file? In this case https://upload.wikimedia.org/wikipedia/commons/1/1c/0_A.D._logo.png Thanks. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] How to get direct link to image
> You can also use the Wikimedia Commons API made by Magnus: https://tools.wmflabs.org/magnus-toolserver/commonsapi.php > It will also gives you metadata about the image (so you'll be able to cite > the author of the image when you reuse it). Is the same metadata also available in the Turtle/HDT dump? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
@Wouter > Thanks for the pointer! I'm downloading from > https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now. Any luck so far? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> @Laura: I suspect Wouter wants to know if he "ignores" the previous errors > and proposes a rather incomplete dump (just for you) or waits for Stas' > feedback. OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could be added to the schedule, it should show up with the next dump and then so forth with the future dumps. Right now even the Turtle dump contains the bad triples, so adding a HDT file now would not introduce more inconsistencies. The problem will be fixed automatically with the future dumps once the Turtle is fixed (because the HDT is generated from the .ttl file anyway). > Btw why don't you use the oldest version in HDT website? 1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware 2. because it's 1 year old :) ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
I feel like you are misrepresenting my request, and possibly trying to offend me as well. My "UC" as you call it, is simply that I would like to have a local copy of wikidata, and query it using SPARQL. Everything that I've tried so far doesn't seem to work on commodity hardware since the database is so large. But HDT could work. So I asked if a HDT dump could, please, be added to other dumps that are periodically generated by wikidata. I also told you already that *I AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm told that I *MUST* generate some other index first which unfortunately I can't generate for the same reasons that I can convert the Turtle to HDT. So what I was trying to say is, that if wikidata were to add any HDT dump, this dump should contain both the .hdt file and .hdt.index in order to be useful. That's about it, and it's not just about me. Anybody who wants to have a local copy of wikidata could benefit from this, since setting up a .hdt file seems much easier than a Turtle dump. And I don't understand why you're trying to blame me for this? If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" or "don't care" response rather than playing the passive-aggressive game that you displayed in your last email. > Let me try to understand ... > You are a "data consumer" with the following needs: > - Latest version of the data > - Quick access to the data > - You don't want to use the current ways to access the data by the > publisher (endpoint, ttl dumps, LDFragments) > However, you ask for a binary format (HDT), but you don't have enough memory > to set up your own environment/endpoint due to lack of memory. > For that reason, you are asking the publisher to support both .hdt and > .hdt.index files. > > Do you think there are many users with your current UC? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> I've just loaded the provided hdt file on a big machine (32 GiB wasn't enough to build the index but ten times this is more than enough) Could you please share a bit about your setup? Do you have a machine with 320GB of RAM? Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd be interested to read your results on this too. Thank you! > I'll try to run a few queries to see how it behaves. I don't think there is a command-line tool to parse SPARQL queries, so you probably have to setup a Fuseki endpoint which uses HDT as a data source. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> It's a machine with 378 GiB of RAM and 64 threads running Scientific > Linux 7.2, that we use mainly for benchmarks. > > Building the index was really all about memory because the CPUs have > actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared > to those of my regular workstation, which was unable to build it. If your regular workstation was using more CPU, I guess it was because of swapping. Thanks for the statistics, it means a "commodity" CPU could handle this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine like yours... it sounds like in the $30K-$50K range? > You're right. The limited query language of hdtSearch is closer to > grep than to SPARQL. > > Thank you for pointing out Fuseki, I'll have a look at it. I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist (yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, although my HDT files are all small. Feel free to drop me an email if you need any help setting up Fuseki. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> I am currently downloading the latest ttl file. On a 250gig ram machine. I > will see if that is sufficient to run the conversion Otherwise we have > another busy one with around 310 gig. Thank you! > For querying I use the Jena query engine. I have created a module called > HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple > program and under development that should be able to use the full power of > SPARQL and be more advanced than grep… ;) Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint? > If this all works out I will see with our department if we can set up if it > is still needed a weekly cron job to convert the TTL file. But as it is > growing rapidly we might run into memory issues later? Thank you! ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Please take me out from these conversations. Sorry for the long thread, this is probably a small inconvenience with mailing list. However the "Subject" is always the same, you can delete messages right away without having to read them. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> There is also a command line tool called hdtsparql in the hdt-java distribution that allows exactly this. It used to support only SELECT queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK queries too. There are some limitations, for example only CSV output is supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE. Thank you for sharing. > The tool is in the hdt-jena package (not hdt-java-cli where the other command line tools reside), since it uses parts of Jena (e.g. ARQ). > There is a wrapper script called hdtsparql.sh for executing it with the proper Java environment. Does this tool work nicely with large HDT files such as wikidata? Or does it need to load the whole graph+index into memory? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
Hello list, a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset! In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback. For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly. Thank you!! ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> Thank you for this feedback, Laura. > Is the hdt index you got available somewhere on the cloud? Unfortunately it's not. It was a private link that was temporarily shared with me by email. I guess I could re-upload the file somewhere else myself, but my uplink is really slow (1Mbps). ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681) > for providing a HDT dump, let’s see if someone else (ideally from the ops > team) responds to it. (I’m not familiar with the systems we currently use for > the dumps, so I can’t say if they have enough resources for this.) Thank you Lucas! ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year? Sent: Tuesday, November 07, 2017 at 3:24 PM From: "Jérémie Roquet" To: "Discussion list for the Wikidata project." Subject: Re: [Wikidata] Wikidata HDT dump Hi everyone, I'm afraid the current implementation of HDT is not ready to handle more than 4 billions triples as it is limited to 32 bit indexes. I've opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135 Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work. -- Jérémie ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> drops `a wikibase:Item` and `a wikibase:Statement` types off topic but... why drop `a wikibase:Item`? Without this it seems impossible to retrieve a list of items. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Wikipedia page from wikidata ID
How can I get the Wikipedia URL of a wikibase:Item ID? Searching online I could only find how to do this using the Mediawiki API, but I was wondering if I can extract/generate URLs from the wikidata graph itself. Thanks. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikipedia page from wikidata ID
> schema:about connects Wikidata item with Wikipedias, e.g., > > Wikidata Query Service: "SELECT * WHERE { ?page schema:about wd:Q80 }" > > The triple is also available directly from the MediaWiki entity: > > https://www.wikidata.org/entity/Q80.nt Thank you! I was looking for "outgoing" links from a wikidata item to their corresponding page, but if I understand correctly links point the other way around (from a schema:Article to a wikibase:Item). I think I've got this. Thanks. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikipedia page from wikidata ID
> I am not sure where you are trying to do this and how but > https://www.wikidata.org/wiki/Special:GoToLinkedPage[https://www.wikidata.org/wiki/Special:GoToLinkedPage] > might be useful. You can call it with an item ID and a wiki code in the URL > and it will redirect you to the article on that wiki. Thanks Lydia. I was trying to retrieve the wikipedia page from the RDF dump. "schema:about" seems to be the right property. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] RDF: All vs Truthy
Can somebody please explain (in simple terms) what's the difference between "all" and "truthy" RDF dumps? I've read the explanation available on the wiki [1] but I still don't get it. If I'm just a user of the data, because I want to retrieve information about a particular item and link items with other graphs... what am I missing/leaving-out by using "truthy" instead of "all"? A practical example would be appreciated since it will clarify things, I suppose. [1] https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] RDF: All vs Truthy
> If you want to know when, why, where, etc, you have to > check the qualified "full" statements. All these qualifiers are encoded as additional triples in "all", correct? ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
* T H A N KY O U * > On 7 Nov I created an HDT file based on the then current download link > from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz Thank you very very much Wouter!! This is great! Out of curiosity, could you please share some info about the machine that you've used to generate these files? In particular I mean hardware info, such as the model names of mobo/cpu/ram/disks. Also "how long" it took to generate these files would be an interesting information. > PS: If this resource turns out to be useful to the community we can > offer an updated HDT file at a to be determined interval. This would be fantastic! Wikidata dumps about once a week, so I think even a new HDT file every 1-2 months would be awesome. Related to this however... why not use the Laundromat for this? There are several datasets that are very large, and rdf2hdt is really expensive to run. Maybe you could schedule regular jobs for several graphs (wikidata, dbpedia, wordnet, linkedgeodata, government data, ...) and make them available at the Laundromat? * T H A N KY O U * ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] DBpedia Databus (alpha version)
I don't understand, is this just another project built on DBPedia, or a project to replace DBPedia entirely? Are you a DBPedia maintainer? Sent: Tuesday, May 08, 2018 at 1:29 PM From: "Sebastian Hellmann" To: "Discussion list for the Wikidata project." Subject: [Wikidata] DBpedia Databus (alpha version) DBpedia Databus (alpha version) The DBpedia Databus is a platform that allows to exchange, curate and access data between multiple stakeholders. Any data entering the bus will be versioned, cleaned, mapped, linked and its licenses and provenance tracked. Hosting in multiple formats will be provided to access the data either as dump download or as API. Data governance stays with the data contributors. Vision Working with data is hard and repetitive. We envision a hub, where everybody can upload data and then useful operations like versioning, cleaning, transformation, mapping, linking, merging, hosting is done automagically on a central communication system (the bus) and then dispersed again in a decentral network to the consumers and applications. On the databus, data flows from data producers through the platform to the consumers (left to right), any errors or feedback flows in the opposite direction and reaches the data source to provide a continuous integration service and improve the data at the source. Open Data vs. Closed (paid) Data We have studied the data network for 10 years now and we conclude that organisations with open data are struggling to work together properly, although they could and should, but are hindered by technical and organisational barriers. They duplicate work on the same data. On the other hand, companies selling data can not do so in a scalable way. The loser is the consumer with the choice of inferior open data or buying from a djungle-like market. Publishing data on the databus If you are grinding your teeth about how to publish data on the web, you can just use the databus to do so. Data loaded on the bus will be highly visible, available and queryable. You should think of it as a service: Visibility guarantees, that your citations and reputation goes up Besides a web download, we can also provide a Linked Data interface, SPARQL endpoint, Lookup (autocomplete) or many other means of availability (like AWS or Docker images) Any distribution we are doing will funnel feedback and collaboration opportunities your way to improve your dataset and your internal data quality You will receive an enriched dataset, which is connected and complemented with any other available data (see the same folder names in data and fusion folders). Data Sellers If you are selling data, the databus provides numerous opportunities for you. You can link your offering to the open entities in the databus. This allows consumers to discover your services better by showing it with each request. Data Consumers Open data on the databus will be a commodity. We are greatly downing the cost for understanding the data, retrieving and reformatting it. We are constantly extending ways of using the data and are willing to implement any formats and APIs you need. If you are lacking a certain kind of data, we can also scout for it and load it onto the databus. How the Databus works at the moment We are still in an initial state, but we already load 10 datasets (6 from DBpedia, 4 external) on the bus using these phases: Acquisition: data is downloaded from the source and logged in Conversion: data is converted to N-Triples and cleaned (Syntax parsing, datatype validation and SHACL) Mapping: the vocabulary is mapped on the DBpedia Ontology and converted (We have been doing this for Wikipedia’s Infoboxes and Wikidata, but now we do it for other datasets as well) Linking: Links are mainly collected from the sources, cleaned and enriched IDying: All entities found are given a new Databus ID for tracking Clustering: ID’s are merged onto clusters using one of the Databus ID’s as cluster representative Data Comparison: Each dataset is compared with all other datasets. We have an algorithm that decides on the best value, but the main goal here is transparency, i.e. to see which data value was chosen and how it compares to the other sources. A main knowledge graph fused from all the sources, i.e. a transparent aggregate For each source, we are producing a local fused version called the “Databus Complement”. This is a major feedback mechanism for all data providers, where they can see what data they are missing, what data differs in other sources and what links are available for their IDs. You can compare all data via a webservice (early prototype, just works for Eiffel Tower): http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general[http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general]
Re: [Wikidata] DBpedia Databus (alpha version)
So, in short, DBPedia is turning into a business with a "community edition + enterprise edition" kind of model? Sent: Tuesday, May 08, 2018 at 2:29 PM From: "Sebastian Hellmann" To: "Discussion list for the Wikidata project" , "Laura Morales" Subject: Re: [Wikidata] DBpedia Databus (alpha version) Hi Laura, I don't understand, is this just another project built on DBPedia, or a project to replace DBPedia entirely? a valid question. DBpedia is quite decentralised and hard to understand in its entirety. So actually some parts are improved and others will be replaced eventually (also an improvement, hopefully). The main improvement here is that, we don't have large monolithic releases that take forever anymore. Especially the language chapters and also the professional community can work better with the "platform" in terms of turnaround, effective contribution and also incentives for contribution. Another thing that will hopefully improve is that we can more sustainably maintain contributions and add-ons, which were formerly lost between releases. So the structure and processes will be clearer. The DBpedia in the "main endpoint" will still be there, but in a way that nl.dbpedia.org/sparql or wikidata.dbpedia.org/sparql is there. The new hosted service will be more a knowledge graph of knowledge graph, where you can get either all information in a fused way or you can quickly jump to the sources, compare and do improvements there. Projects and organisations can also upload their data to query it there themselves or share it with others and persist it. Companies can sell or advertise their data. The core consists of the Wikipedia/Wikidata data and we hope to be able to improve it and also send contributors and contributions back to the Wikiverse. Are you a DBPedia maintainer? Yes, I took it as my task to talk to everybody in the community over the last year and draft/aggregate the new strategy and innovate. All the best, Sebastian On 08.05.2018 13:42, Laura Morales wrote: I don't understand, is this just another project built on DBPedia, or a project to replace DBPedia entirely? Are you a DBPedia maintainer? Sent: Tuesday, May 08, 2018 at 1:29 PM From: "Sebastian Hellmann" [mailto:hellm...@informatik.uni-leipzig.de] To: "Discussion list for the Wikidata project." [mailto:wikidata@lists.wikimedia.org] Subject: [Wikidata] DBpedia Databus (alpha version) DBpedia Databus (alpha version) The DBpedia Databus is a platform that allows to exchange, curate and access data between multiple stakeholders. Any data entering the bus will be versioned, cleaned, mapped, linked and its licenses and provenance tracked. Hosting in multiple formats will be provided to access the data either as dump download or as API. Data governance stays with the data contributors. Vision Working with data is hard and repetitive. We envision a hub, where everybody can upload data and then useful operations like versioning, cleaning, transformation, mapping, linking, merging, hosting is done automagically on a central communication system (the bus) and then dispersed again in a decentral network to the consumers and applications. On the databus, data flows from data producers through the platform to the consumers (left to right), any errors or feedback flows in the opposite direction and reaches the data source to provide a continuous integration service and improve the data at the source. Open Data vs. Closed (paid) Data We have studied the data network for 10 years now and we conclude that organisations with open data are struggling to work together properly, although they could and should, but are hindered by technical and organisational barriers. They duplicate work on the same data. On the other hand, companies selling data can not do so in a scalable way. The loser is the consumer with the choice of inferior open data or buying from a djungle-like market. Publishing data on the databus If you are grinding your teeth about how to publish data on the web, you can just use the databus to do so. Data loaded on the bus will be highly visible, available and queryable. You should think of it as a service: Visibility guarantees, that your citations and reputation goes up Besides a web download, we can also provide a Linked Data interface, SPARQL endpoint, Lookup (autocomplete) or many other means of availability (like AWS or Docker images) Any distribution we are doing will funnel feedback and collaboration opportunities your way to improve your dataset and your internal data quality You will receive an enriched dataset, which is connected and complemented with any other available data (see the same folder names in data and fusion folders). Data Sellers If you are selling data, the databus provides numerous opportunities for you. You can link your offering
Re: [Wikidata] DBpedia Databus (alpha version)
Is this a question for Sebastian, or are you talking on behalf of the project? Sent: Tuesday, May 08, 2018 at 5:10 PM From: "Thad Guidry" To: "Discussion list for the Wikidata project" Cc: "Laura Morales" Subject: Re: [Wikidata] DBpedia Databus (alpha version) So basically... where you get "compute" heavy (querying SPARQL)... you are going to charge fees for providing that compute heavy query service. where you are not "compute" heavy (providing download bandwidth to get files) ... you are not going to charge fees. -Thad ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] DBpedia Databus (alpha version)
> I was more expecting technical questions here, but it seems there is interest > in how the economics work. However, this part is not easy to write for me. I'd personally like to test a demo of the Databus. I'd also like to see a complete list of all the graphs that are available. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] DBpedia Databus (alpha version)
You need my data to show me a demo? I don't understand... it doesn't make sense... Don't you think that people would rather not bother with your demo at all, instead of giving their data to you? You should have a public demo with a demo foaf as well, but anyway if you need my foaf file then you can use this: @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://example.org/LM> a foaf:Person ; foaf:name "Laura" ; foaf:mbox <mailto:la...@example.org> ; foaf:homepage <http://example.org/LM> ; foaf:nick "Laura" . Sent: Friday, May 18, 2018 at 12:04 AM From: "Sebastian Hellmann" To: "Discussion list for the Wikidata project" , "Laura Morales" Subject: Re: [Wikidata] DBpedia Databus (alpha version) Hi Laura, to see a small demo, we would need your data, either your foaf profile or other data, ideally publicly downloadable. Automatic upload is currently being implemented, but I can load it manually or you can wait. At the moment you can see: http://88.99.242.78:9009/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F4o4XK&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FdeathDate&src=dnb.de a data entry where en wikipedia and wikidata have more granular data than the dutch and german national library http://88.99.242.78:9009/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2Fe6R5&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FdeathDate&src=dnb.de[http://88.99.242.78:9009/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2Fe6R5&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FdeathDate&src=dnb.de] (DNB value could actually be imported, although I am not sure if there is a difference, between a source and a reference, i.e. DNB has this statement, but they don't have a reference themselves) a data entry where the german national library has the best value. We also made an infobox mockup for the Eiffel Tower for our grant proposal with a sync button next to the Infobox property: https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSync#Prototype_with_more_focus[https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSync#Prototype_with_more_focus] All the best, Sebastian On 15.05.2018 06:35, Laura Morales wrote: I was more expecting technical questions here, but it seems there is interest in how the economics work. However, this part is not easy to write for me. I'd personally like to test a demo of the Databus. I'd also like to see a complete list of all the graphs that are available. ___ Wikidata mailing listwikid...@lists.wikimedia.org[mailto:Wikidata@lists.wikimedia.org]https://lists.wikimedia.org/mailman/listinfo/wikidata -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org[http://dbpedia.org], http://nlp2rdf.org[http://nlp2rdf.org], http://linguistics.okfn.org[http://linguistics.okfn.org], https://www.w3.org/community/ld4lt[http://www.w3.org/community/ld4lt] Homepage: http://aksw.org/SebastianHellmann[http://aksw.org/SebastianHellmann] Research Group: http://aksw.org[http://aksw.org] ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> a new dump of Wikidata in HDT (with index) is > available[http://www.rdfhdt.org/datasets/]. Thank you very much! Keep it up! Out of curiosity, what computer did you use for this? IIRC it required >512GB of RAM to function. > You will see how Wikidata has become huge compared to other datasets. it > contains about twice the limit of 4B triples discussed above. There is a 64-bit version of HDT that doesn't have this limitation of 4B triples. > In this regard, what is in 2018 the most user friendly way to use this format? Speaking for me at least, Fuseki with a HDT store. But I know there are also some CLI tools from the HDT folks. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> 100 GB "with an optimized code" could be enough to produce an HDT like that. The current software definitely cannot handle wikidata with 100GB. It was tried before and it failed. I'm glad to see that new code will be released to handle large files. After skimming that paper it looks like they split the RDF source into multiple files and "cat" them into a single HDT file. 100GB is still a pretty large footprint, but I'm so glad that they're working on this. A 128GB server is *way* more affordable than one with 512GB or 1TB! I can't wait to try the new code myself. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
> You shouldn't have to keep anything in RAM to HDT-ize something as you could > make the dictionary by sorting on disk and also do the joins to look up > everything against the dictionary by sorting. Yes but somebody has to write the code for it :) My understanding is that they keep everything in memory because it was simpler to develop. The problem is that graphs can become really huge so this approach clearly doesn't scale too well. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata