Smalyshev added a comment.

it doesn’t even open the output file until it’s done converting

That might be a problem when we have 4bn triples... I think "load the whole thing is memory" is a doomed approach - even if we find a way to get past memory limits for current dump, what would happen when it doubles in size?

The idea that you need to keep everything in memory to compress/optimize is of course not true - you can still do pretty fine with disk-based storage, that's what Blazegraph does for example and probably nearly every other graph DB. Yes if would be a bit slower and requires some careful programming, but it's not something that should be impossible. Unfortunately, https://github.com/rdfhdt/hdt-cpp/issues/119 sounds like people behind HDT are not interested in doing this work. Without it, the idea of converting Wikidata data set is a no go, unfortunately - I do not see how Wikidata data set can be served with "load up everything in memory" paradigm. If we find somebody that wants/can do the work that allows HDT to process large datasets, then I think it is a good idea to have it in dumps, but not before that.


TASK DETAIL
https://phabricator.wikimedia.org/T179681

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Smalyshev, Ladsgroup, Arkanosis, Tarrow, Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to