Thanks Robert and Mike for helping dig. I opened
https://issues.apache.org/jira/browse/LUCENE-10203.

On Thu, Oct 21, 2021 at 3:22 PM Michael McCandless <
[email protected]> wrote:

> LOL don't cross the tokenstreams!
>
> Yeah should be 555 or 556 flushes I think.  Probably times the number of
> indexed fields, gets us to the 3K count?
>
> +1 to improve IW's internal re-use in the non-analyzed StringField case.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Oct 21, 2021 at 9:14 AM Robert Muir <[email protected]> wrote:
>
>> So ~ 555 flushes?
>>
>> I see over 3k samples from Adrien's link in
>>
>> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#()
>> I still think the issue is, tokenstreams from analyzers reuse "better"
>> than ones from StringField, because they have a threadlocal? Whereas
>> the StringField relies upon the reuse of IndexingChain.PerField.
>>
>> Maybe it can be better inside IndexWriter, so that it isn't lost on
>> flush? Just don't cross the tokenstreams. It would be bad :)
>>
>> On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless
>> <[email protected]> wrote:
>> >
>> > Ahh we are indeed doing that.  The maxBufferedDocs is total-doc-count /
>> 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby
>> segments" consistent segment geometry in the end.
>> >
>> > But that works out to:
>> >
>> >     maxBufferedDocs=49774
>> >
>> > Which is not too tiny?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <[email protected]> wrote:
>> >>
>> >> Yeah, I'm pretty lost in all the ways we index here. But if we are
>> >> passing maxBufferedDocs <low number> for this deterministic indexing,
>> >> I think it would cause the issue? I have no idea what the IW config
>> >> here is...
>> >>
>> >> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <[email protected]> wrote:
>> >> >
>> >> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]>
>> wrote:
>> >> > >
>> >> > > But also the internal reuse of IndexingChain.PerField (which houses
>> >> > > the reused tokenstream) isn't just per-thread, it is
>> >> > > per-thread-per-segment, right? So if Mike is indexing with 100
>> >> > > threads, and flushes 200 times, I'd expect 20k of these things to
>> be
>> >> > > made. There's a lot going on in the benchmark code for nightly and
>> it
>> >> > > is tricky for me to try to navigate the various cases (1KB,
>> >> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>> >> >
>> >> > I think this might be the case with your link. If you look at the URL
>> >> > of your actual link, you see it ends with
>> #profiler_4kb_indexing_1_cpu
>> >> > ?
>> >> > This makes me think i'm looking at the profiler output of the
>> >> > "deterministic indexing".
>> >> > For this one, LogDocMergePolicy is used.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

-- 
Adrien

Reply via email to