LOL don't cross the tokenstreams! Yeah should be 555 or 556 flushes I think. Probably times the number of indexed fields, gets us to the 3K count?
+1 to improve IW's internal re-use in the non-analyzed StringField case. Mike McCandless http://blog.mikemccandless.com On Thu, Oct 21, 2021 at 9:14 AM Robert Muir <[email protected]> wrote: > So ~ 555 flushes? > > I see over 3k samples from Adrien's link in > > org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#() > I still think the issue is, tokenstreams from analyzers reuse "better" > than ones from StringField, because they have a threadlocal? Whereas > the StringField relies upon the reuse of IndexingChain.PerField. > > Maybe it can be better inside IndexWriter, so that it isn't lost on > flush? Just don't cross the tokenstreams. It would be bad :) > > On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless > <[email protected]> wrote: > > > > Ahh we are indeed doing that. The maxBufferedDocs is total-doc-count / > 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby > segments" consistent segment geometry in the end. > > > > But that works out to: > > > > maxBufferedDocs=49774 > > > > Which is not too tiny? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <[email protected]> wrote: > >> > >> Yeah, I'm pretty lost in all the ways we index here. But if we are > >> passing maxBufferedDocs <low number> for this deterministic indexing, > >> I think it would cause the issue? I have no idea what the IW config > >> here is... > >> > >> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <[email protected]> wrote: > >> > > >> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]> wrote: > >> > > > >> > > But also the internal reuse of IndexingChain.PerField (which houses > >> > > the reused tokenstream) isn't just per-thread, it is > >> > > per-thread-per-segment, right? So if Mike is indexing with 100 > >> > > threads, and flushes 200 times, I'd expect 20k of these things to be > >> > > made. There's a lot going on in the benchmark code for nightly and > it > >> > > is tricky for me to try to navigate the various cases (1KB, > >> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc) > >> > > >> > I think this might be the case with your link. If you look at the URL > >> > of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu > >> > ? > >> > This makes me think i'm looking at the profiler output of the > >> > "deterministic indexing". > >> > For this one, LogDocMergePolicy is used. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
