[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-07 Thread Arkanosis
Arkanosis added a comment.
I'm afraid the current implementation of HDT is not ready to handle more than 4 billions triples as it is limited to 32 bit indexes. I've opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work.TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArkanosisCc: Addshore, Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-06 Thread Arkanosis
Arkanosis added a comment.

In T179681#3736916, @Lucas_Werkmeister_WMDE wrote:
I ran the conversion directly from the ttl.gz file

Interesting, I couldn’t get that to work and had to pipe gunzip output into the program.


Interesting, indeed… Could it be that you added the -f ttl flag afterwards? I couldn't get it to accept a gzip file as input without this flag (I assume it does file format detection based on the file extension).

Also, I had to install zlib-devel to get rdfhdt to compile on a CentOS 6 container — there might be some non-zlib-enabled build on Debian that isn't available on RedHat.

I also tried converting the latest dump, and since I don’t have access to any system with that much RAM, I thought I could perhaps trade some execution time for swap space. Bad idea :) the process got through 20% of the input file and then slowed to a crawl, at data rates of single-digit kilobytes per second. It would’ve taken half a year to finish at that rate.

Thanks for testing! That would have required a hell lot of swap space anyway. Easy to setup for whoever does this on a regular basis, but for casual needs, I've never seen a machine with 200+ GiB of swap space.

But FWIW, here’s the command I used, with a healthy dose of systemd sandboxing since it’s a completely unknown program I’m running:
 

Thanks for sharing the sandboxing bits! :-)TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArkanosisCc: Addshore, Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-06 Thread Addshore
Addshore added a comment.

In T179681#3736044, @Addshore wrote:
@Smalyshev we discussed dumping the JNL files used by blaze graph directly at points during wikidata con.
 I'm aware that isnt a HDT dump, but im wondering if this would help in any way.


Can we reliably get a consistent snapshot of those files when BlazeGraph is constantly writing updates to them?

It would probably be easy enough to pause updating on a host, turn off blazegraph, rsync the file to somewhere and turn blazegraph and the updater back on.
The real question is would this be usefull for people.

With the docker images that I have created it would mean that on a docker host / in a docker container with enough resources people could spin up a matching version of blazegraph and query the data with no timeout.TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: AddshoreCc: Addshore, Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-06 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.
I ran the conversion directly from the ttl.gz file

Interesting, I couldn’t get that to work and had to pipe gunzip output into the program.

I also tried converting the latest dump, and since I don’t have access to any system with that much RAM, I thought I could perhaps trade some execution time for swap space. Bad idea :) the process got through 20% of the input file and then slowed to a crawl, at data rates of single-digit kilobytes per second. It would’ve taken half a year to finish at that rate.

But FWIW, here’s the command I used, with a healthy dose of systemd sandboxing since it’s a completely unknown program I’m running:

time pv latest-all.ttl.gz |
gunzip |
sudo systemd-run --wait --pipe --unit rdf2hdt \
-p CapabilityBoundingSet=CAP_DAC_OVERRIDE \
-p ProtectSystem=strict p PrivateNetwork=yes -p ProtectHome=yes -p PrivateDevices=yes \
-p ProtectKernelTunables=yes -p ProtectControlGroups=yes \
-p NoNewPrivileges=yes -p RestrictNamespaces=yes \
-p MemoryAccounting=yes -p CPUAccounting=yes -p BlockIOAccounting=yes -p IOAccounting=yes -p TasksAccounting=yes \
/usr/local/bin/rdf2hdt -i -f ttl -B 'http://wikiba.se/ontology-beta#Dump' /dev/stdin /dev/stdout \
>| wikidata-2017-11-01.hdt

I had to make install the program because the libtoolized dev build doesn’t really support being run like that. (See systemd/systemd#7254 for the CapabilityBoundingSet part – knowing what I know now, -p $USER would’ve been the better choice.)


In T179681#3736044, @Addshore wrote:
@Smalyshev we discussed dumping the JNL files used by blaze graph directly at points during wikidata con.
 I'm aware that isnt a HDT dump, but im wondering if this would help in any way.


Can we reliably get a consistent snapshot of those files when BlazeGraph is constantly writing updates to them?TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Lucas_Werkmeister_WMDECc: Addshore, Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-05 Thread Addshore
Addshore added a comment.
@Smalyshev we discussed dumping the JNL files used by blaze graph directly at points during wikidata con.
I'm aware that isnt a HDT dump, but im wondering if this would help in any way.TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: AddshoreCc: Addshore, Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-05 Thread Arkanosis
Arkanosis added a comment.
FWIW, I've just tried to convert the ttl dump of the 1st of November 2017 on a machine with 378 GiB of RAM and 0 GiB of swap and… well… it failed with std::bad_alloc after more than 21 hours of runtime. Granted, there was another process eating ~100 GiB of memory, but I thought it would be okay — so I'm proved wrong.

As I was optimistic, I ran the conversion directly from the ttl.gz file, maybe preventing some memory mapping optimization, and also added the -i flag to generate the index at the same time. I'll re-run the conversion without these in the hope of finally getting the hdt file.

So, here are the statistics I got:

$ /usr/bin/time -v rdf2hdt -f ttl -i -p wikidata-20171101-all.ttl.gz  wikidata-20171101-all.hdt
Catch exception load: std::bad_alloc
ERROR: std::bad_alloc
Command exited with non-zero status 1
Command being timed: "rdf2hdt -f ttl -i -p wikidata-20171101-all.ttl.gz wikidata-20171101-all.hdt"
User time (seconds): 64999.77
System time (seconds): 10906.79
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 21:13:25
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 200475524
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 703
Minor (reclaiming a frame) page faults: 8821385485
Voluntary context switches: 36774
Involuntary context switches: 4514261
Swaps: 0
File system inputs: 81915000
File system outputs: 2767696
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 1
/usr/bin/time -v rdf2hdt -f ttl -i -p wikidata-20171101-all.ttl.gz   64999,77s user 10906,80s system 99% cpu 21:13:25,50 total

NB: the exceptionally long runtime is the result of the conversion being single-threaded while the machine has a lot of threads but a relatively low per-thread performance (2.3 Ghz). The process wasn't under memory pressure until it crashed (no swap anyway) and wasn't waiting much for I/O — so it was all CPU-bound.TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArkanosisCc: Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-03 Thread Smalyshev
Smalyshev added a comment.
it doesn’t even open the output file until it’s done converting

That might be a problem when we have 4bn triples... I think "load the whole thing is memory" is a doomed approach - even if we find a way to get past memory limits for current dump, what would happen when it doubles in size?

The idea that you need to keep everything in memory to compress/optimize is of course not true - you can still do pretty fine with disk-based storage, that's what Blazegraph does for example and probably nearly every other graph DB. Yes if would be a bit slower and requires some careful programming, but it's not something that should be impossible. Unfortunately, https://github.com/rdfhdt/hdt-cpp/issues/119 sounds like people behind HDT are not interested in doing this work. Without it, the idea of converting Wikidata data set is a no go, unfortunately - I do not see how Wikidata data set can be served with "load up everything in memory" paradigm. If we find somebody that wants/can do the work that allows HDT to process large datasets, then I think it is a good idea to have it in dumps, but not before that.TASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T179681: Add HDT dump of Wikidata

2017-11-03 Thread Ladsgroup
Ladsgroup added a comment.
https://gist.github.com/lucaswerkmeister/351ad0ffac1191658e39975063b8c19aTASK DETAILhttps://phabricator.wikimedia.org/T179681EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: LadsgroupCc: Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs