Re: [Wikidata] Wikidata HDT dump

Jérémie Roquet Tue, 31 Oct 2017 11:04:14 -0700

2017-10-31 14:56 GMT+01:00 Laura Morales <laure...@mail.com>:
> 1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. 
> query) require to build an index before I can use the HDT file. I've tried to 
> create the index, but I ran out of memory again (even though the index is 
> smaller than the .hdt file itself). So any Wikidata dump should contain both 
> the .hdt file and the .hdt.index file unless there is another way to generate 
> the index on commodity hardware


I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough), so
here are a few interesting metrics:
 - the index alone is ~14 GiB big uncompressed, ~9 GiB gzipped and
~6.5 GiB xzipped ;
 - once loaded in hdtSearch, Wikidata uses ~36 GiB of virtual memory ;
 - right after index generation, it includes ~16 GiB of anonymous
memory (with no memory pressure, that's ~26 GiB resident)…
 - …but after a reload, the index is memory mapped as well, so it only
includes ~400 MiB of anonymous memory (and a mere ~1.2 GiB resident).

Looks like a good candidate for commodity hardware, indeed. It loads
in less than one second on a 32 GiB machine. I'll try to run a few
queries to see how it behaves.

FWIW, my use case is very similar to yours, as I'd like to run queries
that are too long for the public SPARQL endpoint and can't dedicate a
powerful machine do this full time (Blazegraph runs fine with 32 GiB,
though — it just takes a while to index and updating is not as fast as
the changes happening on wikidata.org).

-- 
Jérémie

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

Reply via email to