Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

2018-07-24 Thread François Dumont

Hi

    Any chance to review this patch ?

François


On 06/06/2018 18:39, François Dumont wrote:

Hi

    I review and rework this proposal. I noticed that the same idea to 
limit buffer size within inplace_merge also apply to stable_sort.


    I also change the decision when buffer is too small to consider 
the buffer size rather than going through successive cuts of the 
original ranges. This way we won't cut the range more than necessary. 
Note that if you prefer this second part could be committed seperatly.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__stable_partition_adaptive): When buffer to small cut range using
    buffer size.
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__merge_adaptive): When buffer too small consider smallest range 
first

    and cut based on buffer size.
    (__stable_sort_adaptive): When buffer too small cut based on buffer
    size.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
    to reduce requested buffer length on allocation failure.

Tested under Linux x86_64.

Ok to commit ?

François


On 25/01/2018 23:37, chang jc wrote:

Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




   21
   20
   19
   18
   17
   16


   0.471136
   0.625695
   0.767262
   0.907461
   1.04838
   1.19508


   0.340845
   0.48651
   0.639139
   0.770133
   0.898454
   1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of larger one.

After rotate, it promises two split arrays both contain 
ceiling[small/2].


Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi


 It sounds like a very sensitive change to make but nothing worth 
figures.

Do you have any bench showing the importance of the gain ?

 At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue by 
inefficient


logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
 _BidirectionalIterator __second_cut = __middle;
 _Distance __len11 = 0;
 _Distance __len22 = 0;
-  if (__len1 > __len2)
+  if (__len1 < __len2)
   {
 __len11 = __len1 / 2;
 std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
 const _DistanceType __len1 = std::distance(__first, __middle);
 const _DistanceType __len2 = std::distance(__middle, __last);

+

 typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;

-  _TmpBuf __buf(__first, __last);
-
+  _BidirectionalIterator __start, __end;
+  if (__len1 < __len2) {
+    __start = __first; __end = __middle;
+  } else {
+    __start = __middle; __end = __last;
+  }
+  _TmpBuf __buf(__start, ___end);

Note another optimization, we could introduce a _Temporary_buffer<> 
constructor

in order to write:

_TmpBuf __

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

2019-06-09 Thread François Dumont

On 12/21/18 9:57 PM, Jonathan Wakely wrote:

On 29/10/18 07:06 +0100, François Dumont wrote:

Hi

    Some feedback regarding this patch ?


Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.


Here is this patch again.

This time it is much closer to John's original one, I just kept my 
change on the size of the temporary buffer which doesn't need to be as 
large as it is currently allocated, especially with John's patch.


The performance tests are showing the good improvements, attached are 
the before/after. Surprisingly I do not see any change regarding 
allocated memory, I guess measures are not accurate enough. However 
working on them also show that current implementation is fragile. If you 
reduce the allowed allocated memory too much you end up with a stack 
overflow because of the recursive implementation.


I already have a patch to keep on trying to allocate memory as long as 
above a given threshold rather than 0 but I am also working on making 
implementation non-recursive. We'll see if even a buffer of size 1 is 
better than no buffer at all then.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__merge_adaptive): When buffer too small consider smallest range first
    and cut based on buffer size.
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
    to reduce requested buffer length on allocation failure.
    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
    execution with different level of memory.

François

inplace_merge.ccreverse   19r   19u
0s48mem0pf 
inplace_merge.ccforwards   0r0u
0s 0mem0pf 
inplace_merge.ccrandom18r   18u
0s 0mem0pf 
inplace_merge.ccreverse   10r   10u
0s 0mem0pf 
inplace_merge.ccforwards   8r8u
0s 0mem0pf 
inplace_merge.ccrandom22r   22u
0s 0mem0pf 
inplace_merge.ccbench 1/2 memory1891r 1884u
7s   928mem0pf 
inplace_merge.ccreverse   19r   18u
0s 0mem0pf 
inplace_merge.ccforwards   0r0u
0s 0mem0pf 
inplace_merge.ccrandom18r   18u
0s 0mem0pf 
inplace_merge.ccreverse   10r   10u
0s 0mem0pf 
inplace_merge.ccforwards   3r3u
0s 0mem0pf 
inplace_merge.ccrandom22r   22u
0s 0mem0pf 
inplace_merge.ccbench 1/4 memory1867r 1861u
6s   192mem0pf 
inplace_merge.ccreverse   17r   17u
0s 0mem0pf 
inplace_merge.ccforwards   0r0u
0s 0mem0pf 
inplace_merge.ccrandom25r   24u
0s 0mem0pf 
inplace_merge.ccreverse   28r   28u
0s 0mem0pf 
inplace_merge.ccforwards   0r0u
0s 0mem0pf 
inplace_merge.ccrandom   289r  289u
0s 0mem0pf 
inplace_merge.ccbench no memory 2155r 2149u
6s   192mem0pf 
stable_sort.cc  reverse  240r  240u
1s 0mem0pf 
stable_sort.cc  forwards 205r  205u
0s 0mem0pf 
stable_sort.cc  random   493r  493u
0s 0mem0pf 
stable_sort.cc  bench full memory945r  939u
5s   752mem0pf 
stable_sort.cc  reverse  236r  236u
0s 0mem0pf 
stable_sort.cc  forwards 199r  199u
0s 0mem0pf 
stable_sort.cc  random   487r  487u
0s 0mem0pf 
stable_sort.cc  bench 1/2 memory  

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

2019-07-16 Thread François Dumont

Hi

    I eventually spent much more time working on the inplace_merge 
performance bench.


    And the results do not confirm the theory:

Before patch:
inplace_merge.cc     bench 1 / 1 memory        243r  227u   
17s  1216mem    5pf
inplace_merge.cc     bench 1 / 4 memory        297r 278u   
18s   480mem    0pf
inplace_merge.cc     bench 1 /64 memory       373r 354u   
18s   480mem    0pf
inplace_merge.cc     bench 0 / 1 memory    12r 11u 
0s   480mem    0pf


After the patch to reduce memory allocation:
inplace_merge.cc     bench 1 / 1 memory        245r  227u   
18s  1216mem    0pf
inplace_merge.cc     bench 1 / 4 memory        292r 273u   
19s   480mem    0pf
inplace_merge.cc     bench 1 /64 memory       373r 356u   
17s   480mem    0pf
inplace_merge.cc     bench 0 / 1 memory    11r 11u 
0s   480mem    0pf


With the __len1 > __len2 condition change:
inplace_merge.cc     bench 1 / 1 memory        244r  225u   
20s  1216mem    0pf
inplace_merge.cc     bench 1 / 4 memory        379r 361u   
17s   480mem    0pf
inplace_merge.cc     bench 1 /64 memory        640r 625u   
16s   480mem    0pf
inplace_merge.cc     bench 0 / 1 memory         11r 11u    
0s   480mem    0pf


When there is no memory limit the results are identical of course. 
Otherwise as soon as memory is limited performance starts to decrease 
with the condition change on __len1 vs __len2.


Could you give the bench you use to demonstrate the enhancement ? I also 
wonder why your patch doesn't change consistently the same condition in 
__merge_without_buffer ?


For the moment I'd like to propose the attached patch that is to say the 
reduction on the amount of allocated memory and the new/modified benches.


Note that as soon as I forbid any memory allocation I also reduce the 
size of the range to merge cause the implementation rely on recursivity 
and so could end-up in a stack overflow. Maybe I need to check for 
simulator targets like several other tests ? Unless simulators do not 
run the performance tests ?


Regarding this stack overflow issue, is there some routine to find out 
how many levels of function calls can be added before reaching the stack 
overflow ? I could perhaps call __builtin_alloca and check the result 
but that smells. If I could find out this we could fallback on an 
iterative approach to complete the merge.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
    computation in the loop.
    * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
    pivot positions.
    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
    execution with different memory constraints.

Ok to commit ?

François

On 6/9/19 4:27 PM, François Dumont wrote:

On 12/21/18 9:57 PM, Jonathan Wakely wrote:

On 29/10/18 07:06 +0100, François Dumont wrote:

Hi

    Some feedback regarding this patch ?


Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.


Here is this patch again.

This time it is much closer to John's original one, I just kept my 
change on the size of the temporary buffer which doesn't need to be as 
large as it is currently allocated, especially with John's patch.


The performance tests are showing the good improvements, attached are 
the before/after. Surprisingly I do not see any change regarding 
allocated memory, I guess measures are not accurate enough. However 
working on them also show that current implementation is fragile. If 
you reduce the allowed allocated memory too much you end up with a 
stack overflow because of the recursive implementation.


I already have a patch to keep on trying to allocate memory as long as 
above a given threshold rather than 0 but I am also working on making 
implementation non-recursive. We'll see if even a buffer of size 1 is 
better than no buffer at all then.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__merge_adaptive): When buffer too small consider smallest range 
first

    and cut based on buffer size.
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
    to reduce requested buffer length on all

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

2018-12-21 Thread Jonathan Wakely

On 29/10/18 07:06 +0100, François Dumont wrote:

Hi

    Some feedback regarding this patch ?


Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.





Thanks,
François

On 8/21/18 10:34 PM, François Dumont wrote:
I missed a test that was failing because of this patch. So I revert 
a small part of it and here is the new proposal.


Tested under Linux x86_64, ok to commit ?

François


On 24/07/2018 12:22, François Dumont wrote:

Hi

    Any chance to review this patch ?

François


On 06/06/2018 18:39, François Dumont wrote:

Hi

    I review and rework this proposal. I noticed that the same 
idea to limit buffer size within inplace_merge also apply to 
stable_sort.


    I also change the decision when buffer is too small to 
consider the buffer size rather than going through successive 
cuts of the original ranges. This way we won't cut the range 
more than necessary. Note that if you prefer this second part 
could be committed seperatly.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__stable_partition_adaptive): When buffer to small cut range using
    buffer size.
    (__inplace_merge): Take temporary buffer length from 
smallest range.
    (__merge_adaptive): When buffer too small consider smallest 
range first

    and cut based on buffer size.
    (__stable_sort_adaptive): When buffer too small cut based on buffer
    size.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change 
function

    to reduce requested buffer length on allocation failure.

Tested under Linux x86_64.

Ok to commit ?

François


On 25/01/2018 23:37, chang jc wrote:

Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




   21
   20
   19
   18
   17
   16


   0.471136
   0.625695
   0.767262
   0.907461
   1.04838
   1.19508


   0.340845
   0.48651
   0.639139
   0.770133
   0.898454
   1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of 
larger one.


After rotate, it promises two split arrays both contain 
ceiling[small/2].


Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi


 It sounds like a very sensitive change to make but 
nothing worth figures.

Do you have any bench showing the importance of the gain ?

 At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue by 
inefficient


logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
 _BidirectionalIterator __second_cut = __middle;
 _Distance __len11 = 0;
 _Distance __len22 = 0;
-  if (__len1 

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)#

2020-11-19 Thread Jonathan Wakely via Gcc-patches

On 16/07/19 18:40 +0200, François Dumont wrote:

Hi

    I eventually spent much more time working on the inplace_merge 
performance bench.


    And the results do not confirm the theory:

Before patch:
inplace_merge.cc             bench 1 / 1 memory            243r  
227u   17s      1216mem    5pf
inplace_merge.cc             bench 1 / 4 memory            297r 278u   
18s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory           373r 354u   
18s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory            12r 11u     
0s       480mem    0pf


After the patch to reduce memory allocation:
inplace_merge.cc             bench 1 / 1 memory            245r  
227u   18s      1216mem    0pf
inplace_merge.cc             bench 1 / 4 memory            292r 273u   
19s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory           373r 356u   
17s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory            11r 11u     
0s       480mem    0pf


With the __len1 > __len2 condition change:
inplace_merge.cc             bench 1 / 1 memory            244r  
225u   20s      1216mem    0pf
inplace_merge.cc             bench 1 / 4 memory            379r 361u   
17s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory            640r 625u   
16s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory             11r 11u    
0s       480mem    0pf


When there is no memory limit the results are identical of course. 
Otherwise as soon as memory is limited performance starts to decrease 
with the condition change on __len1 vs __len2.


Could you give the bench you use to demonstrate the enhancement ? I 
also wonder why your patch doesn't change consistently the same 
condition in __merge_without_buffer ?


For the moment I'd like to propose the attached patch that is to say 
the reduction on the amount of allocated memory and the new/modified 
benches.


Note that as soon as I forbid any memory allocation I also reduce the 
size of the range to merge cause the implementation rely on 
recursivity and so could end-up in a stack overflow. Maybe I need to 
check for simulator targets like several other tests ? Unless 
simulators do not run the performance tests ?


The performance tests are never run by default.

I don't think we should spend too much time caring about performance
of sorting close to Out Of Memory conditions. We don't try to optimize
std::vector or other cases to work when close to OOM.

So I think just reducing the memory usage is the right approach here.

Regarding this stack overflow issue, is there some routine to find out 
how many levels of function calls can be added before reaching the 
stack overflow ? I could perhaps call __builtin_alloca and check the 
result but that smells. If I could find out this we could fallback on 
an iterative approach to complete the merge.


No, alloca is inherently unsafe. Please don't even consider it :-)



    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
    computation in the loop.


Please use "Change __len computation in the loop to avoid truncation."

It's a bit more descriptive.


    * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
    pivot positions.
    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
    execution with different memory constraints.

Ok to commit ?

François

On 6/9/19 4:27 PM, François Dumont wrote:

On 12/21/18 9:57 PM, Jonathan Wakely wrote:

On 29/10/18 07:06 +0100, François Dumont wrote:

Hi

    Some feedback regarding this patch ?


Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.


Here is this patch again.

This time it is much closer to John's original one, I just kept my 
change on the size of the temporary buffer which doesn't need to be 
as large as it is currently allocated, especially with John's patch.


The performance tests are showing the good improvements, attached 
are the before/after.

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

2018-10-28 Thread François Dumont

Hi

    Some feedback regarding this patch ?

Thanks,
François

On 8/21/18 10:34 PM, François Dumont wrote:
I missed a test that was failing because of this patch. So I revert a 
small part of it and here is the new proposal.


Tested under Linux x86_64, ok to commit ?

François


On 24/07/2018 12:22, François Dumont wrote:

Hi

    Any chance to review this patch ?

François


On 06/06/2018 18:39, François Dumont wrote:

Hi

    I review and rework this proposal. I noticed that the same idea 
to limit buffer size within inplace_merge also apply to stable_sort.


    I also change the decision when buffer is too small to consider 
the buffer size rather than going through successive cuts of the 
original ranges. This way we won't cut the range more than 
necessary. Note that if you prefer this second part could be 
committed seperatly.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__stable_partition_adaptive): When buffer to small cut range using
    buffer size.
    (__inplace_merge): Take temporary buffer length from smallest 
range.
    (__merge_adaptive): When buffer too small consider smallest 
range first

    and cut based on buffer size.
    (__stable_sort_adaptive): When buffer too small cut based on buffer
    size.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change 
function

    to reduce requested buffer length on allocation failure.

Tested under Linux x86_64.

Ok to commit ?

François


On 25/01/2018 23:37, chang jc wrote:

Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




   21
   20
   19
   18
   17
   16


   0.471136
   0.625695
   0.767262
   0.907461
   1.04838
   1.19508


   0.340845
   0.48651
   0.639139
   0.770133
   0.898454
   1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of larger 
one.


After rotate, it promises two split arrays both contain 
ceiling[small/2].


Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi


 It sounds like a very sensitive change to make but nothing 
worth figures.

Do you have any bench showing the importance of the gain ?

 At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue by 
inefficient


logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
 _BidirectionalIterator __second_cut = __middle;
 _Distance __len11 = 0;
 _Distance __len22 = 0;
-  if (__len1 > __len2)
+  if (__len1 < __len2)
   {
 __len11 = __len1 / 2;
 std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
 const _DistanceType __len1 = std::distance(__first, 
__middle);

 const _DistanceType __len2 = std::distance(__middle, __last);

+

 typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;

-  _TmpBuf __buf(__first, __last);
-
+

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

2018-08-21 Thread François Dumont
I missed a test that was failing because of this patch. So I revert a 
small part of it and here is the new proposal.


Tested under Linux x86_64, ok to commit ?

François


On 24/07/2018 12:22, François Dumont wrote:

Hi

    Any chance to review this patch ?

François


On 06/06/2018 18:39, François Dumont wrote:

Hi

    I review and rework this proposal. I noticed that the same idea 
to limit buffer size within inplace_merge also apply to stable_sort.


    I also change the decision when buffer is too small to consider 
the buffer size rather than going through successive cuts of the 
original ranges. This way we won't cut the range more than necessary. 
Note that if you prefer this second part could be committed seperatly.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__stable_partition_adaptive): When buffer to small cut range using
    buffer size.
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__merge_adaptive): When buffer too small consider smallest range 
first

    and cut based on buffer size.
    (__stable_sort_adaptive): When buffer too small cut based on buffer
    size.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
    to reduce requested buffer length on allocation failure.

Tested under Linux x86_64.

Ok to commit ?

François


On 25/01/2018 23:37, chang jc wrote:

Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




   21
   20
   19
   18
   17
   16


   0.471136
   0.625695
   0.767262
   0.907461
   1.04838
   1.19508


   0.340845
   0.48651
   0.639139
   0.770133
   0.898454
   1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of larger 
one.


After rotate, it promises two split arrays both contain 
ceiling[small/2].


Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi


 It sounds like a very sensitive change to make but nothing 
worth figures.

Do you have any bench showing the importance of the gain ?

 At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue by 
inefficient


logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
 _BidirectionalIterator __second_cut = __middle;
 _Distance __len11 = 0;
 _Distance __len22 = 0;
-  if (__len1 > __len2)
+  if (__len1 < __len2)
   {
 __len11 = __len1 / 2;
 std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
 const _DistanceType __len1 = std::distance(__first, __middle);
 const _DistanceType __len2 = std::distance(__middle, __last);

+

 typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;

-  _TmpBuf __buf(__first, __last);
-
+  _BidirectionalIterator __start, __end;
+  if (__len1 < __len2) {
+    __start = __first; __end = __middle