Lucas_Werkmeister_WMDE updated the task description. (Show Details)

CHANGES TO TASK DESCRIPTION
...
There is an `rdf2hdt` tool ([link](https://github.com/rdfhdt/hdt-cpp/tree/develop/libhdt); LGPLv2.1+) that can convert TTL dumps to HDT files. Unfortunately, it doesn’t run in a streaming fashion (it doesn’t even open the output file until it’s done converting) and seems to require almost as much memory as the uncompressed TTL dump to run. I tried to run it on the latest Wikidata dump, but the program was OOM-killed after having consumed 2.32 GiB of the gzipped input dump (according to `pv`), which corresponds to 15.63 GiB of uncompressed input data; the last `VmSize` before it was killed was 13.04 GiB. As the full uncompressed TTL dump is 187 GiB (201 GB), it looks like we would need a machine with at least ~200 GB of memory to do the conversion. ~~(Perhaps we could get away with using lots of swap space instead of actual RAM – I have no idea what kind of memory access patterns the tool has.)~~

As for the processing time, on my system 9% of the dump were processed in 23 minutes, so the full conversion would probably take some hours, but not days. The CPU time as reported by Bash’s `time` builtin was actually less than the wall-clock time, so it doesn’t look like the tool is multi-threaded. But of course it’s possible that there is some additional phase of processing after the tool is done reading the file, and I have no idea how long that could take.
...

TASK DETAIL
https://phabricator.wikimedia.org/T179681

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Addshore, Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to