Re: Text indexing Wikidata

Andy Seaborne Sat, 19 Feb 2022 04:44:40 -0800



On 19/02/2022 08:00, Lorenz Buehmann wrote:

Hi,
so far you can't do anything else - the whole indexing pipeline issingle-threaded as far as I know. It simply iterates all propertiesdeclared to be used for fetching the RDF triple values - Lucene indexingitself would be threadsafe, so the easiest thing would be to apply onewriter thread per property. This clearly would not help here when youjust set rdfs:label as only property. Thus, we would have to also splitthe dataset somehow for the given property and then would be able todistribute each split to a separate writer thread.
The main loop is here and makes it rather easy to understand where wecould introduce parallelism:https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143
Multiple read from a dataset is trivial, we just have to get appropriatesplits - not sure how easy this is, maybe a cursor/iterator on thesubjects with different offsets or something?


Read single thread on one thread,
split triples
collect blocks of triple (1000? 100000?) and send to a separate thread
other threads do the Lucene indexing


@Andy what do you think?


Good idea.

    Andy

On 18.02.22 09:59, Neubert, Joachim wrote:
Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels(in parts using text:LowerCaseKeywordAnalyzer) on the massive parallelmachine.
I observed a CPU usage of 100-250 %, and wonder if I could dosomething to speed up. My command line simply was
java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug--desc=/tmp/temp.ttl
(apache-jena-fuseki-4.5.0-SNAPSHOT)

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462

Re: Text indexing Wikidata

Reply via email to