On 5/10/24 07:53, Andy Fan wrote:
> 
> Tomas Vondra <tomas.von...@enterprisedb.com> writes:
> 
>>> I guess both of you are talking about worker process, if here are
>>> something in my mind:
>>>
>>> *btbuild* also let the WORKER dump the tuples into Sharedsort struct
>>> and let the LEADER merge them directly. I think this aim of this design
>>> is it is potential to save a mergeruns. In the current patch, worker dump
>>> to local tuplesort and mergeruns it and then leader run the merges
>>> again. I admit the goal of this patch is reasonable, but I'm feeling we
>>> need to adapt this way conditionally somehow. and if we find the way, we
>>> can apply it to btbuild as well. 
>>>
>>
>> I'm a bit confused about what you're proposing here, or how is that
>> related to what this patch is doing and/or to the what Matthias
>> mentioned in his e-mail from last week.
>>
>> Let me explain the relevant part of the patch, and how I understand the
>> improvement suggested by Matthias. The patch does the work in three phases:
> 
> What's in my mind is:
> 
> 1. WORKER-1
> 
> Tempfile 1:
> 
> key1:  1
> key3:  2
> ...
> 
> Tempfile 2:
> 
> key5:  3
> key7:  4
> ...
> 
> 2. WORKER-2
> 
> Tempfile 1:
> 
> Key2: 1
> Key6: 2
> ...
> 
> Tempfile 2:
> Key3: 3
> Key6: 4
> ..
> 
> In the above example:  if we do the the merge in LEADER, only 1 mergerun
> is needed. reading 4 tempfile 8 tuples in total and write 8 tuples.
> 
> If we adds another mergerun into WORKER, the result will be:
> 
> WORKER1:  reading 2 tempfile 4 tuples and write 1 tempfile (called X) 4
> tuples. 
> WORKER2:  reading 2 tempfile 4 tuples and write 1 tempfile (called Y) 4
> tuples. 
> 
> LEADER:  reading 2 tempfiles (X & Y) including 8 tuples and write it
> into final tempfile.
> 
> So the intermedia result X & Y requires some extra effort.  so I think
> the "extra mergerun in worker" is *not always* a win, and my proposal is
> if we need to distinguish the cases in which one we should add the
> "extra mergerun in worker" step.
> 

The thing you're forgetting is that the mergesort in the worker is
*always* a simple append, because the lists are guaranteed to be
non-overlapping, so it's very cheap. The lists from different workers
are however very likely to overlap, and hence a "full" mergesort is
needed, which is way more expensive.

And not only that - without the intermediate merge, there will be very
many of those lists the leader would have to merge.

If we do the append-only merges in the workers first, we still need to
merge them in the leader, of course, but we have few lists to merge
(only about one per worker).

Of course, this means extra I/O on the intermediate tuplesort, and it's
not difficult to imagine cases with no benefit, or perhaps even a
regression. For example, if the keys are unique, the in-worker merge
step can't really do anything. But that seems quite unlikely IMHO.

Also, if this overhead was really significant, we would not see the nice
speedups I measured during testing.

>> The trouble with (2) is that it "just copies" data from one tuplesort
>> into another, increasing the disk space requirements. In an extreme
>> case, when nothing can be combined, it pretty much doubles the amount of
>> disk space, and makes the build longer.
> 
> This sounds like the same question as I talk above, However my proposal
> is to distinguish which cost is bigger between "the cost saving from
> merging TIDs in WORKERS" and "cost paid because of the extra copy",
> then we do that only when we are sure we can benefits from it, but I
> know it is hard and not sure if it is doable. 
> 

Yeah. I'm not against picking the right execution strategy during the
index build, but it's going to be difficult, because we really don't
have the information to make a reliable decision.

We can't even use the per-column stats, because it does not say much
about the keys extracted by GIN, I think. And we need to do the decision
at the very beginning, before we write the first batch of data either to
the local or shared tuplesort.

But maybe we could wait until we need to flush the first batch of data
(in the callback), and make the decision then? In principle, if we only
flush once at the end, the intermediate sort is not needed at all (fairy
unlikely for large data sets, though).

Well, in principle, maybe we could even start writing into the local
tuplesort, and then "rethink" after a while and switch to the shared
one. We'd still need to copy data we've already written to the local
tuplesort, but hopefully that'd be just a fraction compared to doing
that for the whole table.


>> What I think Matthias is suggesting, is that this "TID list merging"
>> could be done directly as part of the tuplesort in step (1). So instead
>> of just moving the "sort tuples" from the appropriate runs, it could
>> also do an optional step of combining the tuples and writing this
>> combined tuple into the tuplesort result (for that worker).
> 
> OK, I get it now. So we talked about lots of merge so far at different
> stage and for different sets of tuples.
> 
> 1. "GIN deform buffer" did the TIDs merge for the same key for the tuples
> in one "deform buffer" batch, as what the current master is doing.
> 
> 2. "in memory buffer sort" stage, currently there is no TID merge so
> far and Matthias suggest that. 
> 
> 3. Merge the TIDs for the same keys in LEADER vs in WORKER first +
> LEADER then. this is what your 0002 commit does now and I raised some
> concerns as above.
> 

OK

>> Matthias also mentioned this might be useful when building btree indexes
>> with key deduplication.
> 
>> AFAICS this might work, although it probably requires for the "combined"
>> tuple to be smaller than the sum of the combined tuples (in order to fit
>> into the space). But at least in the GIN build in the workers this is
>> likely true, because the TID lists do not overlap (and thus not hurting
>> the compressibility).
>>
>> That being said, I still see this more as an optimization than something
>> required for the patch,
> 
> If GIN deform buffer is big enough (like greater than the in memory
> buffer sort) shall we have any gain because of this, since the
> scope is the tuples in in-memory-buffer-sort. 
> 

I don't think this is very likely. The only case when the GIN deform
tuple is "big enough" is when we don't need to flush in the callback,
but that is going to happen only for "small" tables. And for those we
should not really do parallel builds. And even if we do, the overhead
would be pretty insignificant.

>> and I don't think I'll have time to work on this
>> anytime soon. The patch is not extremely complex, but it's not trivial
>> either. But if someone wants to take a stab at extending tuplesort to
>> allow this, I won't object ...
> 
> Agree with this. I am more interested with understanding the whole
> design and the scope to fix in this patch, and then I can do some code
> review and testing, as for now, I still in the "understanding design and
> scope" stage. If I'm too slow about this patch, please feel free to
> commit it any time and I don't expect I can find any valueable
> improvement and bugs.  I probably needs another 1 ~ 2 weeks to study
> this patch.
> 

Sure, happy to discuss and answer questions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to