Looking at the source/stating the obvious, creating a new StringTokenString from here only happens under certain conditions: * field is indexed * field is not tokenized (e.g. not using analyzer, ID field or similar) * incoming "reuse" parameter is not a StringTokenStream
What is puzzling to me is that it only seems to hit the 4KB documents. If there is an issue here, I'd expect it to have an even higher impact for the 1KB documents indexing. But also the internal reuse of IndexingChain.PerField (which houses the reused tokenstream) isn't just per-thread, it is per-thread-per-segment, right? So if Mike is indexing with 100 threads, and flushes 200 times, I'd expect 20k of these things to be made. There's a lot going on in the benchmark code for nightly and it is tricky for me to try to navigate the various cases (1KB, 1KB-with-vectors, 4KB, "deterministic indexing", etc) On Thu, Oct 21, 2021 at 3:40 AM Adrien Grand <[email protected]> wrote: > > Hello, > > I've been looking a bit more carefully at nightly benchmarks recently and I'm > puzzled by the fact that indexing spends almost 5% of the time on > AttributeSource#addAttribute. Here is the link. > > 4.37% 14731 > org.apache.lucene.util.AttributeSource#addAttribute() > at > org.apache.lucene.document.Field$StringTokenStream#() > at > org.apache.lucene.document.Field#tokenStream() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() > at > org.apache.lucene.index.DocumentsWriter#updateDocuments() > at > org.apache.lucene.index.IndexWriter#updateDocuments() > at > org.apache.lucene.index.IndexWriter#updateDocument() > at > org.apache.lucene.index.IndexWriter#addDocument() > at perf.IndexThreads$IndexThread#run() > > Given that nightly benchmarks reuse Field instances across documents, this > should only happen once per thread, so why does it show up as a bottleneck in > our nightly benchmarks? I tried to reproduce locally, but I'm not seeing > AttributeSource among top CPU consumers. > > -- > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
