On 19/02/2022 08:00, Lorenz Buehmann wrote:
Hi,

so far you can't do anything else - the whole indexing pipeline is single-threaded as far as I know. It simply iterates all properties declared to be used for fetching the RDF triple values - Lucene indexing itself would be threadsafe, so the easiest thing would be to apply one writer thread per property. This clearly would not help here when you just set rdfs:label as only property. Thus, we would have to also split the dataset somehow for the given property and then would be able to distribute each split to a separate writer thread.

The main loop is here and makes it rather easy to understand where we could introduce parallelism: https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143

Multiple read from a dataset is trivial, we just have to get appropriate splits - not sure how easy this is, maybe a cursor/iterator on the subjects with different offsets or something?

Read single thread on one thread,
split triples
collect blocks of triple (1000? 100000?) and send to a separate thread
other threads do the Lucene indexing


@Andy what do you think?

Good idea.

    Andy


On 18.02.22 09:59, Neubert, Joachim wrote:
Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels (in parts using text:LowerCaseKeywordAnalyzer) on the massive parallel machine.

I observed a CPU usage of 100-250 %, and wonder if I could do something to speed up. My command line simply was

java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug --desc=/tmp/temp.ttl

(apache-jena-fuseki-4.5.0-SNAPSHOT)

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462


Reply via email to