Thanks Robert and Mike for helping dig. I opened https://issues.apache.org/jira/browse/LUCENE-10203.
On Thu, Oct 21, 2021 at 3:22 PM Michael McCandless < [email protected]> wrote: > LOL don't cross the tokenstreams! > > Yeah should be 555 or 556 flushes I think. Probably times the number of > indexed fields, gets us to the 3K count? > > +1 to improve IW's internal re-use in the non-analyzed StringField case. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Oct 21, 2021 at 9:14 AM Robert Muir <[email protected]> wrote: > >> So ~ 555 flushes? >> >> I see over 3k samples from Adrien's link in >> >> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#() >> I still think the issue is, tokenstreams from analyzers reuse "better" >> than ones from StringField, because they have a threadlocal? Whereas >> the StringField relies upon the reuse of IndexingChain.PerField. >> >> Maybe it can be better inside IndexWriter, so that it isn't lost on >> flush? Just don't cross the tokenstreams. It would be bad :) >> >> On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless >> <[email protected]> wrote: >> > >> > Ahh we are indeed doing that. The maxBufferedDocs is total-doc-count / >> 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby >> segments" consistent segment geometry in the end. >> > >> > But that works out to: >> > >> > maxBufferedDocs=49774 >> > >> > Which is not too tiny? >> > >> > Mike McCandless >> > >> > http://blog.mikemccandless.com >> > >> > >> > On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <[email protected]> wrote: >> >> >> >> Yeah, I'm pretty lost in all the ways we index here. But if we are >> >> passing maxBufferedDocs <low number> for this deterministic indexing, >> >> I think it would cause the issue? I have no idea what the IW config >> >> here is... >> >> >> >> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <[email protected]> wrote: >> >> > >> >> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]> >> wrote: >> >> > > >> >> > > But also the internal reuse of IndexingChain.PerField (which houses >> >> > > the reused tokenstream) isn't just per-thread, it is >> >> > > per-thread-per-segment, right? So if Mike is indexing with 100 >> >> > > threads, and flushes 200 times, I'd expect 20k of these things to >> be >> >> > > made. There's a lot going on in the benchmark code for nightly and >> it >> >> > > is tricky for me to try to navigate the various cases (1KB, >> >> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc) >> >> > >> >> > I think this might be the case with your link. If you look at the URL >> >> > of your actual link, you see it ends with >> #profiler_4kb_indexing_1_cpu >> >> > ? >> >> > This makes me think i'm looking at the profiler output of the >> >> > "deterministic indexing". >> >> > For this one, LogDocMergePolicy is used. >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> -- Adrien
