from:"Laura Morales"

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-25 Thread Laura Morales

Is there any RDF dump available of OpenCorporates data? Or even any dump at 
all? Their licensing terms are ambiguous... They say it's released under ODbL, 
but if I want to use the data I have to ask permission and they will decide if 
I can use it for free or if I have to pay a fee :/

Sent: Wednesday, October 25, 2017 at 9:44 AM
From: "Jakob Voß" 
To: wikidata@lists.wikimedia.org
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata
Hi Luigi,

I favour cooperation with OpenCorporates instead of independently adding
lots of company record to Wikidata. Sure there are parallel strategies
but any effort should also include OpenCorporates to some degree.

OpenCorporates is licensed under ODbL (just added this referenced
statement to Q7095760) and we have property P1320 to link Wikidata and
OpenCorporates. A first step would be to align

https://opencorporates.com/registers
https://en.wikipedia.org/wiki/List_of_company_registers

Right now we have 18 instances of company register (Q1394657) and its
subclasses explicitly classified as such in Wikidata.

These items should be linked with the registers listed at
OpenCorporates, e.g.

UK Companies House (Q257303)
= 
https://opencorporates.com/registers/270[https://opencorporates.com/registers/270]

I've also noticed that OpenCorporates has a field for "Identifiers"
where Wikidata QIDs may be included to have two-way-links between the
two datasets.

Anyway, better contact 
https://opencorporates.com/info/contributing[https://opencorporates.com/info/contributing]
 at
least to let them know about your plans.

Cheers,
Jakob

--
Jakob Voß 
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/[http://www.gbv.de/]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-26 Thread Laura Morales

OK, just asked. Their reply was that they "reserves the right under paragraph 
3.3 of ODbL to release the database under different terms", which is to say 
their data is NOT free because they want to control how and where the data is 
used.
Are we starting to see "free vs open" all over again, this time with data 
instead of software?
 
 

Sent: Wednesday, October 25, 2017 at 5:06 PM
From: "Thad Guidry" 
To: "Discussion list for the Wikidata project." 
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata

Laura,
 
Talk to OpenCorporates and ask those questions yourself.
Get involved ! :)
 

-Thad
+ThadGuidry[https://plus.google.com/+ThadGuidry]
  

On Wed, Oct 25, 2017 at 3:22 AM Laura Morales 
mailto:laure...@mail.com]> wrote:Is there any RDF dump 
available of OpenCorporates data? Or even any dump at all? Their licensing 
terms are ambiguous... They say it's released under ODbL, but if I want to use 
the data I have to ask permission and they will decide if I can use it for free 
or if I have to pay a fee :/
 ___ Wikidata mailing list 
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with the 
already available Turtle dump [2]. HDT is a binary format to store RDF data, 
which is pretty useful because it can be queried from command line, it can be 
used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space 
to store the same data. The problem is that it's very impractical to generate a 
HDT, because the current implementation requires a lot of RAM processing to 
convert a file. For Wikidata it will probably require a machine with 100-200GB 
of RAM. This is unfeasible for me because I don't have such a machine, but if 
you guys have one to share, I can help setup the rdf2hdt software required to 
convert Wikidata Turtle to HDT.

Thank you.

[1] http://www.rdfhdt.org/
[2] https://dumps.wikimedia.org/wikidatawiki/entities/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

> Would it be an idea if HDT remains unfeasible to place the journal file of 
> blazegraph online?
> Yes, people need to use blazegraph if they want to access the files and query 
> it but it could be an extra next to turtle dump?

How would a blazegraph journal file be better than a Turtle dump? Maybe it's 
smaller in size? Simpler to use?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

> Dear Laura, others,
> 
> If somebody points me to the RDF datadump of Wikidata I can deliver an
> HDT version for it, no problem. (Given the current cost of memory I
> do not believe that the memory consumption for HDT creation is a
> blocker.)

This would be awesome! Thanks Wouter. To the best of my knowledge, the most up 
to date dump is this one [1]. Let me know if you need any help with anything. 
Thank you again!

[1] https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

---
Cheers,
Wouter Beek.

Email: wou...@triply.cc
WWW: http://triply.cc
Tel: +31647674624


On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales  wrote:
> Hello everyone,
>
> I'd like to ask if Wikidata could please offer a HDT [1] dump along with the 
> already available Turtle dump [2]. HDT is a binary format to store RDF data, 
> which is pretty useful because it can be queried from command line, it can be 
> used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space 
> to store the same data. The problem is that it's very impractical to generate 
> a HDT, because the current implementation requires a lot of RAM processing to 
> convert a file. For Wikidata it will probably require a machine with 
> 100-200GB of RAM. This is unfeasible for me because I don't have such a 
> machine, but if you guys have one to share, I can help setup the rdf2hdt 
> software required to convert Wikidata Turtle to HDT.
>
> Thank you.
>
> [1] http://www.rdfhdt.org/[http://www.rdfhdt.org/]
> [2] 
> https://dumps.wikimedia.org/wikidatawiki/entities/[https://dumps.wikimedia.org/wikidatawiki/entities/]
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

> You can mount te jnl file directly to blazegraph so loading and indexing is 
> not needed anymore.

How much larger would this be compared to the Turtle file?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

> is it possible to store a weighted adjacency matrix as an HDT instead of an 
> RDF?
> 
> Something like a list of entities for each entity, or even better a list of 
> tuples for each entity.
> So that a tuple could be generalised with propoerties.

Sorry I don't know this, you would have to ask the devs. As far as I 
understand, it's a triplestore and that should be it...

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

> Javier D. Fernández of the HDT team was very quick to fix the link :-)

their dump is almost ~1 year old though.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales

> The first part of the Turtle data stream seems to contain syntax errors for 
> some of the XSD decimal literals.  The first one appears on line 13,291:
> 
> Notice that scientific notation is not allowed in the lexical form of 
> decimals according to XML > Schema Part 2: 
> Datatypes[https://www.w3.org/TR/xmlschema11-2/#decimal].  (It is allowed in 
> floats and doubles.)  Is this a known issue or should I report this somewhere?

I wouldn't call these "syntax" errors, just "logical/type" errors.
It would be great if these could fixed by changing the correct type from 
decimal to float/double. On the other hand, I've never seen any medium or large 
dataset without this kind of errors. So I would personally treat these as 
warnings at worst.

@Wouter when you build the HDT file, could you please also generate the 
.hdt.index file? With rdf2hdt, this should be activated with the -i flag. Thank 
you again!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales

> @Wouter: As Stas said, you might report that error. I don't agree with Laura 
> who tried to under estimate that "syntax error". It's also about quality ;)


Don't get me wrong, I am all in favor of data quality! :) So if this can be 
fixed, it's better! The thing is, that I've seen so many datasets with these 
kind of type errors, that by now I pretty much live with them and I'm OK with 
these warnings (the triple is not broken after all, it's just not following the 
standards).


> @Laura: Do you have a different rdf2hdt program or the one in the GitHub of 
> HDT project ?


I just use https://github.com/rdfhdt/hdt-cpp compiled from the master branch. 
To verify data instead, I use RIOT (a CL tool from the Apache Jena package) 
like this `riot --validate file.nt`.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales

> Also, for avoiding your users to re-create the models, you can pre-load 
> "models" from LOV catalog.

The LOV RDF dump is broken instead. Or at least it still was the last time I 
checked. And I don't broken in the sense of Wikidata, that is with some wrong 
types, I mean broken as it doesn't validate at all (some triples are broken).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales

> Thanks to report that. I remember one issue that I added here 
> https://github.com/pyvandenbussche/lov/issues/66


Yup, still broken! I've tried just now.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales

> No, the idea is that each organization will have its own KNS, so users can 
> add the KNS that they want. 

How would this compare with a traditional SPARQL endpoint + "federated 
queries", or with "linked fragments"?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales

> @Laura : you mean this list http://lov.okfn.org/lov.nq.gz ?
> I can download it !!
> 
> Which one ? Please send me the URL and I can fix it !!


Yes you can download it, but the nq file is broken. It doesn't validate because 
some URIs contains white spaces, and some triples have an empty subject (ie. 
<>).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales

> KBox is an alternative to other existing architectures for publishing KB such 
> as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.
> I should add that you can do federated query with KBox as as easier as you 
> can do with SPARQL endpoints.


OK, but I still fail to see what is the value of this? What's the reason why 
I'd want to use it rather than just start a Fuseki endpoint, or use 
linked-fragments?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] How to get direct link to image

2017-10-30 Thread Laura Morales

- wikidata entry: https://www.wikidata.org/wiki/Q161234
- "logo image" property pointing to: 
https://commons.wikimedia.org/wiki/File:0_A.D._logo.png

However... that's a HTML page... How do I get a reference to the .png file? In 
this case https://upload.wikimedia.org/wikipedia/commons/1/1c/0_A.D._logo.png

Thanks.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] How to get direct link to image

2017-10-30 Thread Laura Morales

> You can also use the Wikimedia Commons API made by Magnus:
https://tools.wmflabs.org/magnus-toolserver/commonsapi.php
> It will also gives you metadata about the image (so you'll be able to cite 
> the author of the image when you reuse it).

Is the same metadata also available in the Turtle/HDT dump?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-30 Thread Laura Morales

@Wouter

> Thanks for the pointer!  I'm downloading from 
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

Any luck so far?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales

> @Laura: I suspect Wouter wants to know if he "ignores" the previous errors 
> and proposes a rather incomplete dump (just for you) or waits for Stas' 
> feedback.


OK. I wonder though, if it would be possible to setup a regular HDT dump 
alongside the already regular dumps. Looking at the dumps page, 
https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is 
generated once a week more or less. So if a HDT dump could be added to the 
schedule, it should show up with the next dump and then so forth with the 
future dumps. Right now even the Turtle dump contains the bad triples, so 
adding a HDT file now would not introduce more inconsistencies. The problem 
will be fixed automatically with the future dumps once the Turtle is fixed 
(because the HDT is generated from the .ttl file anyway).


> Btw why don't you use the oldest version in HDT website?


1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) 
require to build an index before I can use the HDT file. I've tried to create 
the index, but I ran out of memory again (even though the index is smaller than 
the .hdt file itself). So any Wikidata dump should contain both the .hdt file 
and the .hdt.index file unless there is another way to generate the index on 
commodity hardware

2. because it's 1 year old :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales

I feel like you are misrepresenting my request, and possibly trying to offend 
me as well.

My "UC" as you call it, is simply that I would like to have a local copy of 
wikidata, and query it using SPARQL. Everything that I've tried so far doesn't 
seem to work on commodity hardware since the database is so large. But HDT 
could work. So I asked if a HDT dump could, please, be added to other dumps 
that are periodically generated by wikidata. I also told you already that *I 
AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm 
told that I *MUST* generate some other index first which unfortunately I can't 
generate for the same reasons that I can convert the Turtle to HDT. So what I 
was trying to say is, that if wikidata were to add any HDT dump, this dump 
should contain both the .hdt file and .hdt.index in order to be useful. That's 
about it, and it's not just about me. Anybody who wants to have a local copy of 
wikidata could benefit from this, since setting up a .hdt file seems much 
easier than a Turtle dump. And I don't understand why you're trying to blame me 
for this?

If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" 
or "don't care" response rather than playing the passive-aggressive game that 
you displayed in your last email.


> Let me try to understand ... 
> You are a "data consumer" with the following needs:
>   - Latest version of the data
>   - Quick access to the data
>   - You don't want to use the current ways to access the data by the 
> publisher (endpoint, ttl dumps, LDFragments)
>  However, you ask for a binary format (HDT), but you don't have enough memory 
> to set up your own environment/endpoint due to lack of memory.
> For that reason, you are asking the publisher to support both .hdt and 
> .hdt.index files. 
>  
> Do you think there are many users with your current UC?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales

> I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough)


Could you please share a bit about your setup? Do you have a machine with 320GB 
of RAM?
Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd 
be interested to read your results on this too.
Thank you!


> I'll try to run a few queries to see how it behaves.


I don't think there is a command-line tool to parse SPARQL queries, so you 
probably have to setup a Fuseki endpoint which uses HDT as a data source.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales

> It's a machine with 378 GiB of RAM and 64 threads running Scientific
> Linux 7.2, that we use mainly for benchmarks.
> 
> Building the index was really all about memory because the CPUs have
> actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
> to those of my regular workstation, which was unable to build it.


If your regular workstation was using more CPU, I guess it was because of 
swapping. Thanks for the statistics, it means a "commodity" CPU could handle 
this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine 
like yours... it sounds like in the $30K-$50K range?


> You're right. The limited query language of hdtSearch is closer to
> grep than to SPARQL.
> 
> Thank you for pointing out Fuseki, I'll have a look at it.


I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist 
(yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, 
although my HDT files are all small. Feel free to drop me an email if you need 
any help setting up Fuseki.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Laura Morales

> I am currently downloading the latest ttl file. On a 250gig ram machine. I 
> will see if that is sufficient to run the conversion Otherwise we have 
> another busy one with  around 310 gig.

Thank you!

> For querying I use the Jena query engine. I have created a module called 
> HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple 
> program and under development that should be able to use the full power of 
> SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and 
without the need to setup a Fuseki endpoint?

> If this all works out I will see with our department if we can set up if it 
> is still needed a weekly cron job to convert the TTL file. But as it is 
> growing rapidly we might run into memory issues later?

Thank you!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Laura Morales

> Please take me out from these conversations.

Sorry for the long thread, this is probably a small inconvenience with mailing 
list. However the "Subject" is always the same, you can delete messages right 
away without having to read them.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-02 Thread Laura Morales

> There is also a command line tool called hdtsparql in the hdt-java
distribution that allows exactly this. It used to support only SELECT
queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK
queries too. There are some limitations, for example only CSV output is
supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE.

Thank you for sharing.

> The tool is in the hdt-jena package (not hdt-java-cli where the other
command line tools reside), since it uses parts of Jena (e.g. ARQ).
> There is a wrapper script called hdtsparql.sh for executing it with the
proper Java environment.

Does this tool work nicely with large HDT files such as wikidata? Or does it 
need to load the whole graph+index into memory?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Laura Morales

Hello list,

a very kind person from this list has generated the .hdt.index file for me, 
using the 1-year old wikidata HDT file available at the rdfhdt website. So I 
was finally able to setup a working local endpoint using HDT+Fuseki. Set up was 
easy, launch time (for Fuseki) also was quick (a few seconds), the only change 
I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm 
not very proficient in Java, so I hope this is the correct way). I've ran some 
queries too. Simple select or traversal queries seems fast to me (I haven't 
measured them but the response is almost immediate), other queries such as 
"select distinct ?class where { [] a ?class }" takes several seconds or a few 
minutes to complete, which kinda tells me the HDT indexes don't work well on 
all queries. But otherwise for simple queries it works perfectly! At least I'm 
able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for using 
HDT on a "commodity computer", which means it can be very useful for people 
like me who want to use the dataset locally but who can't setup a full-blown 
server. If others want to try as well, they can offer more (hopefully positive) 
feedback.
For all of this, I heartwarmingly plea any wikidata dev to please consider 
scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps 
that it creates weekly.

Thank you!!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Laura Morales

> Thank you for this feedback, Laura. 
> Is the hdt index you got available somewhere on the cloud?

Unfortunately it's not. It was a private link that was temporarily shared with 
me by email. I guess I could re-upload the file somewhere else myself, but my 
uplink is really slow (1Mbps).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Laura Morales

> I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681) 
> for providing a HDT dump, let’s see if someone else (ideally from the ops 
> team) responds to it. (I’m not familiar with the systems we currently use for 
> the dumps, so I can’t say if they have enough resources for this.)

Thank you Lucas!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Laura Morales

How many triples does wikidata have? The old dump from rdfhdt seem to have 
about 2 billion, which means wikidata doubled the number of triples in less 
than a year?

Sent: Tuesday, November 07, 2017 at 3:24 PM
From: "Jérémie Roquet" 
To: "Discussion list for the Wikidata project." 
Subject: Re: [Wikidata] Wikidata HDT dump
Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.

--
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Laura Morales

> drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible 
to retrieve a list of items.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Wikipedia page from wikidata ID

2017-11-11 Thread Laura Morales

How can I get the Wikipedia URL of a wikibase:Item ID? Searching online I could 
only find how to do this using the Mediawiki API, but I was wondering if I can 
extract/generate URLs from the wikidata graph itself.

Thanks.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikipedia page from wikidata ID

2017-11-11 Thread Laura Morales

> schema:about connects Wikidata item with Wikipedias, e.g.,
> 
> Wikidata Query Service: "SELECT * WHERE { ?page schema:about wd:Q80 }"
> 
> The triple is also available directly from the MediaWiki entity:
> 
> https://www.wikidata.org/entity/Q80.nt


Thank you! I was looking for "outgoing" links from a wikidata item to their 
corresponding page, but if I understand correctly links point the other way 
around (from a schema:Article to a wikibase:Item). I think I've got this. 
Thanks.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikipedia page from wikidata ID

2017-11-12 Thread Laura Morales

> I am not sure where you are trying to do this and how but 
> https://www.wikidata.org/wiki/Special:GoToLinkedPage[https://www.wikidata.org/wiki/Special:GoToLinkedPage]
>  might be useful. You can call it with an item ID and a wiki code in the URL 
> and it will redirect you to the article on that wiki. 


Thanks Lydia. I was trying to retrieve the wikipedia page from the RDF dump. 
"schema:about" seems to be the right property.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] RDF: All vs Truthy

2017-12-02 Thread Laura Morales

Can somebody please explain (in simple terms) what's the difference between 
"all" and "truthy" RDF dumps? I've read the explanation available on the wiki 
[1] but I still don't get it.
If I'm just a user of the data, because I want to retrieve information about a 
particular item and link items with other graphs... what am I 
missing/leaving-out by using "truthy" instead of "all"?
A practical example would be appreciated since it will clarify things, I 
suppose.

[1] https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] RDF: All vs Truthy

2017-12-03 Thread Laura Morales

> If you want to know when, why, where, etc, you have to
> check the qualified "full" statements.

All these qualifiers are encoded as additional triples in "all", correct?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-12-12 Thread Laura Morales

* T H A N KY O U *

> On 7 Nov I created an HDT file based on the then current download link
> from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

Thank you very very much Wouter!! This is great!
Out of curiosity, could you please share some info about the machine that 
you've used to generate these files? In particular I mean hardware info, such 
as the model names of mobo/cpu/ram/disks. Also "how long" it took to generate 
these files would be an interesting information.

> PS: If this resource turns out to be useful to the community we can
> offer an updated HDT file at a to be determined interval.

This would be fantastic! Wikidata dumps about once a week, so I think even a 
new HDT file every 1-2 months would be awesome.
Related to this however... why not use the Laundromat for this? There are 
several datasets that are very large, and rdf2hdt is really expensive to run. 
Maybe you could schedule regular jobs for several graphs (wikidata, dbpedia, 
wordnet, linkedgeodata, government data, ...) and make them available at the 
Laundromat?

* T H A N KY O U *

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] DBpedia Databus (alpha version)

2018-05-08 Thread Laura Morales

I don't understand, is this just another project built on DBPedia, or a project
to replace DBPedia entirely? Are you a DBPedia maintainer?

Sent: Tuesday, May 08, 2018 at 1:29 PM
From: "Sebastian Hellmann"
To: "Discussion list for the Wikidata project."
Subject: [Wikidata] DBpedia Databus (alpha version)

DBpedia Databus (alpha version)

The DBpedia Databus is a platform that allows to exchange, curate and access
data between multiple stakeholders. Any data entering the bus will be
versioned, cleaned, mapped, linked and its licenses and provenance tracked.
Hosting in multiple formats will be provided to access the data either as dump
download or as API. Data governance stays with the data contributors.

Vision

Working with data is hard and repetitive. We envision a hub, where everybody
can upload data and then useful operations like versioning, cleaning,
transformation, mapping, linking, merging, hosting is done automagically on a
central communication system (the bus) and then dispersed again in a decentral
network to the consumers and applications.
On the databus, data flows from data producers through the platform to the
consumers (left to right), any errors or feedback flows in the opposite
direction and reaches the data source to provide a continuous integration
service and improve the data at the source.

Open Data vs. Closed (paid) Data

We have studied the data network for 10 years now and we conclude that
organisations with open data are struggling to work together properly, although
they could and should, but are hindered by technical and organisational
barriers. They duplicate work on the same data. On the other hand, companies
selling data can not do so in a scalable way. The loser is the consumer with
the choice of inferior open data or buying from a djungle-like market.

Publishing data on the databus

If you are grinding your teeth about how to publish data on the web, you can
just use the databus to do so. Data loaded on the bus will be highly visible,
available and queryable. You should think of it as a service:

Visibility guarantees, that your citations and reputation goes up
Besides a web download, we can also provide a Linked Data interface, SPARQL
endpoint, Lookup (autocomplete) or many other means of availability (like AWS
or Docker images)
Any distribution we are doing will funnel feedback and collaboration
opportunities your way to improve your dataset and your internal data quality
You will receive an enriched dataset, which is connected and complemented with
any other available data (see the same folder names in data and fusion folders).

Data Sellers

If you are selling data, the databus provides numerous opportunities for you.
You can link your offering to the open entities in the databus. This allows
consumers to discover your services better by showing it with each request.

Data Consumers

Open data on the databus will be a commodity. We are greatly downing the cost
for understanding the data, retrieving and reformatting it. We are constantly
extending ways of using the data and are willing to implement any formats and
APIs you need.
If you are lacking a certain kind of data, we can also scout for it and load it
onto the databus.

How the Databus works at the moment

We are still in an initial state, but we already load 10 datasets (6 from
DBpedia, 4 external) on the bus using these phases:

Acquisition: data is downloaded from the source and logged in
Conversion: data is converted to N-Triples and cleaned (Syntax parsing,
datatype validation and SHACL)
Mapping: the vocabulary is mapped on the DBpedia Ontology and converted (We
have been doing this for Wikipedia’s Infoboxes and Wikidata, but now we do it
for other datasets as well)
Linking: Links are mainly collected from the sources, cleaned and enriched
IDying: All entities found are given a new Databus ID for tracking
Clustering: ID’s are merged onto clusters using one of the Databus ID’s as
cluster representative
Data Comparison: Each dataset is compared with all other datasets. We have an
algorithm that decides on the best value, but the main goal here is
transparency, i.e. to see which data value was chosen and how it compares to
the other sources.
A main knowledge graph fused from all the sources, i.e. a transparent aggregate
For each source, we are producing a local fused version called the “Databus
Complement”. This is a major feedback mechanism for all data providers, where
they can see what data they are missing, what data differs in other sources and
what links are available for their IDs.
You can compare all data via a webservice (early prototype, just works for
Eiffel Tower):
http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general[http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general]

Re: [Wikidata] DBpedia Databus (alpha version)

2018-05-08 Thread Laura Morales

So, in short, DBPedia is turning into a business with a "community edition + 
enterprise edition" kind of model?

Sent: Tuesday, May 08, 2018 at 2:29 PM
From: "Sebastian Hellmann" 
To: "Discussion list for the Wikidata project" , 
"Laura Morales" 
Subject: Re: [Wikidata] DBpedia Databus (alpha version)

Hi Laura,

I don't understand, is this just another project built on DBPedia, or a project 
to replace DBPedia entirely? 

a valid question. DBpedia is quite decentralised and hard to understand in its 
entirety. So actually some parts are improved and others will be replaced 
eventually (also an improvement, hopefully).
The main improvement here is that, we don't have  large monolithic releases 
that take forever anymore. Especially the language chapters and also the 
professional community can work better with the "platform" in terms of 
turnaround, effective contribution and also incentives for contribution. 
Another thing that will hopefully improve is that we can more sustainably 
maintain contributions and add-ons, which were formerly lost between releases. 
So the structure and processes will be clearer.
The DBpedia in the "main endpoint" will still be there, but in a way that 
nl.dbpedia.org/sparql or wikidata.dbpedia.org/sparql is there. The new hosted 
service will be more a knowledge graph of knowledge graph, where you can get 
either all information in a fused way or you can quickly jump to the sources, 
compare and do improvements there. Projects and organisations can also upload 
their data to query it there themselves or share it with others and persist it. 
Companies can sell or advertise their data. The core consists of the 
Wikipedia/Wikidata data and we hope to be able to improve it and also send 
contributors and contributions back to the Wikiverse.

Are you a DBPedia maintainer?
Yes, I took it as my task to talk to everybody in the community over the last 
year and draft/aggregate the new strategy and innovate.

All the best,
Sebastian

On 08.05.2018 13:42, Laura Morales wrote:
I don't understand, is this just another project built on DBPedia, or a project 
to replace DBPedia entirely? Are you a DBPedia maintainer?

Sent: Tuesday, May 08, 2018 at 1:29 PM
From: "Sebastian Hellmann" 
[mailto:hellm...@informatik.uni-leipzig.de]
To: "Discussion list for the Wikidata project." 
[mailto:wikidata@lists.wikimedia.org]
Subject: [Wikidata] DBpedia Databus (alpha version)

DBpedia Databus (alpha version)

The DBpedia Databus is a platform that allows to exchange, curate and access 
data between multiple stakeholders. Any data entering the bus will be 
versioned, cleaned, mapped, linked and its licenses and provenance tracked. 
Hosting in multiple formats will be provided to access the data either as dump 
download or as API. Data governance stays with the data contributors.

Vision

Working with data is hard and repetitive. We envision a hub, where everybody 
can upload data and then useful operations like versioning, cleaning, 
transformation, mapping, linking, merging, hosting is done automagically on a 
central communication system (the bus) and then dispersed again in a decentral 
network to the consumers and applications.
On the databus, data flows from data producers through the platform to the 
consumers (left to right), any errors or feedback flows in the opposite 
direction and reaches the data source to provide a continuous integration 
service and improve the data at the source.

Open Data vs. Closed (paid) Data

We have studied the data network for 10 years now and we conclude that 
organisations with open data are struggling to work together properly, although 
they could and should, but are hindered by technical and organisational 
barriers. They duplicate work on the same data. On the other hand, companies 
selling data can not do so in a scalable way. The loser is the consumer with 
the choice of inferior open data or buying from a djungle-like market.

Publishing data on the databus

If you are grinding your teeth about how to publish data on the web, you can 
just use the databus to do so. Data loaded on the bus will be highly visible, 
available and queryable. You should think of it as a service:

Visibility guarantees, that your citations and reputation goes up
Besides a web download, we can also provide a Linked Data interface, SPARQL 
endpoint, Lookup (autocomplete) or many other means of availability (like AWS 
or Docker images)
Any distribution we are doing will funnel feedback and collaboration 
opportunities your way to improve your dataset and your internal data quality
You will receive an enriched dataset, which is connected and complemented with 
any other available data (see the same folder names in data and fusion folders).

Data Sellers

If you are selling data, the databus provides numerous opportunities for you. 
You can link your offering

Re: [Wikidata] DBpedia Databus (alpha version)

2018-05-08 Thread Laura Morales

Is this a question for Sebastian, or are you talking on behalf of the project?

 
 

Sent: Tuesday, May 08, 2018 at 5:10 PM
From: "Thad Guidry" 
To: "Discussion list for the Wikidata project" 
Cc: "Laura Morales" 
Subject: Re: [Wikidata] DBpedia Databus (alpha version)

So basically...

where you get "compute" heavy (querying SPARQL)... you are going to charge fees 
for providing that compute heavy query service.
 where you are not "compute" heavy (providing download bandwidth to get files) 
... you are not going to charge fees.
 -Thad
 

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] DBpedia Databus (alpha version)

2018-05-14 Thread Laura Morales

> I was more expecting technical questions here, but it seems there is interest
> in how the economics work. However, this part is not easy to write for me.


I'd personally like to test a demo of the Databus. I'd also like to see a 
complete list of all the graphs that are available.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] DBpedia Databus (alpha version)

2018-05-18 Thread Laura Morales

You need my data to show me a demo? I don't understand... it doesn't make 
sense... Don't you think that people would rather not bother with your demo at 
all, instead of giving their data to you? You should have a public demo with a 
demo foaf as well, but anyway if you need my foaf file then you can use this:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://example.org/LM>
a foaf:Person ;
foaf:name "Laura" ;
foaf:mbox <mailto:la...@example.org> ;
foaf:homepage <http://example.org/LM> ;
foaf:nick "Laura" .

 

 

Sent: Friday, May 18, 2018 at 12:04 AM
From: "Sebastian Hellmann" 
To: "Discussion list for the Wikidata project" , 
"Laura Morales" 
Subject: Re: [Wikidata] DBpedia Databus (alpha version)

Hi Laura,
to see a small demo, we would need your data, either your foaf profile or other 
data, ideally publicly downloadable. Automatic upload is currently being 
implemented, but I can load it manually or you can wait.
At the moment you can see:
http://88.99.242.78:9009/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F4o4XK&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FdeathDate&src=dnb.de
a data entry where en wikipedia and wikidata have more granular data than the 
dutch and german national library
http://88.99.242.78:9009/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2Fe6R5&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FdeathDate&src=dnb.de[http://88.99.242.78:9009/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2Fe6R5&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FdeathDate&src=dnb.de]
(DNB value could actually be imported, although I am not sure if there is a 
difference, between a source and a reference, i.e. DNB has this statement, but 
they don't have a reference themselves)
a data entry where the german national library has the best value.
We also made an infobox mockup for the Eiffel Tower for our grant proposal with 
a sync button next to the Infobox property:
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSync#Prototype_with_more_focus[https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSync#Prototype_with_more_focus]
All the best,
Sebastian
 
On 15.05.2018 06:35, Laura Morales wrote:
I was more expecting technical questions here, but it seems there is interest
in how the economics work. However, this part is not easy to write for me.
I'd personally like to test a demo of the Databus. I'd also like to see a 
complete list of all the graphs that are available.

___
Wikidata mailing 
listwikid...@lists.wikimedia.org[mailto:Wikidata@lists.wikimedia.org]https://lists.wikimedia.org/mailman/listinfo/wikidata

 
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org[http://dbpedia.org], 
http://nlp2rdf.org[http://nlp2rdf.org], 
http://linguistics.okfn.org[http://linguistics.okfn.org], 
https://www.w3.org/community/ld4lt[http://www.w3.org/community/ld4lt]
Homepage: http://aksw.org/SebastianHellmann[http://aksw.org/SebastianHellmann]
Research Group: http://aksw.org[http://aksw.org]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2018-10-01 Thread Laura Morales

> a new dump of Wikidata in HDT (with index) is 
> available[http://www.rdfhdt.org/datasets/].

Thank you very much! Keep it up!
Out of curiosity, what computer did you use for this? IIRC it required >512GB 
of RAM to function.

> You will see how Wikidata has become huge compared to other datasets. it 
> contains about twice the limit of 4B triples discussed above.

There is a 64-bit version of HDT that doesn't have this limitation of 4B 
triples.

> In this regard, what is in 2018 the most user friendly way to use this format?

Speaking for me at least, Fuseki with a HDT store. But I know there are also 
some CLI tools from the HDT folks.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2018-10-02 Thread Laura Morales

> 100 GB "with an optimized code" could be enough to produce an HDT like that.

The current software definitely cannot handle wikidata with 100GB. It was tried 
before and it failed.
I'm glad to see that new code will be released to handle large files. After 
skimming that paper it looks like they split the RDF source into multiple files 
and "cat" them into a single HDT file. 100GB is still a pretty large footprint, 
but I'm so glad that they're working on this. A 128GB server is *way* more 
affordable than one with 512GB or 1TB!

I can't wait to try the new code myself.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2018-10-02 Thread Laura Morales

> You shouldn't have to keep anything in RAM to HDT-ize something as you could 
> make the dictionary by sorting on disk and also do the joins to look up 
> everything against the dictionary by sorting.

Yes but somebody has to write the code for it :)
My understanding is that they keep everything in memory because it was simpler 
to develop. The problem is that graphs can become really huge so this approach 
clearly doesn't scale too well.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

44 matches

Mail list logo