On 19/02/2022 08:00, Lorenz Buehmann wrote:
Hi,
so far you can't do anything else - the whole indexing pipeline is
single-threaded as far as I know. It simply iterates all properties
declared to be used for fetching the RDF triple values - Lucene indexing
itself would be threadsafe, so the easiest thing would be to apply one
writer thread per property. This clearly would not help here when you
just set rdfs:label as only property. Thus, we would have to also split
the dataset somehow for the given property and then would be able to
distribute each split to a separate writer thread.
The main loop is here and makes it rather easy to understand where we
could introduce parallelism:
https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143
Multiple read from a dataset is trivial, we just have to get appropriate
splits - not sure how easy this is, maybe a cursor/iterator on the
subjects with different offsets or something?
Read single thread on one thread,
split triples
collect blocks of triple (1000? 100000?) and send to a separate thread
other threads do the Lucene indexing
@Andy what do you think?
Good idea.
Andy
On 18.02.22 09:59, Neubert, Joachim wrote:
Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels
(in parts using text:LowerCaseKeywordAnalyzer) on the massive parallel
machine.
I observed a CPU usage of 100-250 %, and wonder if I could do
something to speed up. My command line simply was
java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug
--desc=/tmp/temp.ttl
(apache-jena-fuseki-4.5.0-SNAPSHOT)
Cheers, Joachim
--
Joachim Neubert
ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462