Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
Hi Any chance to review this patch ? François On 06/06/2018 18:39, François Dumont wrote: Hi I review and rework this proposal. I noticed that the same idea to limit buffer size within inplace_merge also apply to stable_sort. I also change the decision when buffer is too small to consider the buffer size rather than going through successive cuts of the original ranges. This way we won't cut the range more than necessary. Note that if you prefer this second part could be committed seperatly. PR libstdc++/83938 * include/bits/stl_algo.h: (__stable_partition_adaptive): When buffer to small cut range using buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__stable_sort_adaptive): When buffer too small cut based on buffer size. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. Tested under Linux x86_64. Ok to commit ? François On 25/01/2018 23:37, chang jc wrote: Hi: 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as __len = (__len == 1) ? 0 : ((__len + 1) / 2); 2. The coding gain has been shown PR c++/83938; I re-post here 21 20 19 18 17 16 0.471136 0.625695 0.767262 0.907461 1.04838 1.19508 0.340845 0.48651 0.639139 0.770133 0.898454 1.04632 it means when Merge [0, 4325376, 16777216); A is a sorted integer with 4325376 & B with 12451840 elements, total with 16M integers The proposed method has the speed up under given buffer size =, ex 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means given sizof(int)*64K bytes. 3. As your suggestion, _TmpBuf __buf should be rewrite. 4. It represents a fact that the intuitive idea to split from larger part is wrong. For example, if you have an input sorted array A & B, A has 8 integers & B has 24 integers. Given tmp buffer whose capacity = 4 integers. Current it tries to split from B, right? Then we have: A1 | A2 B1 | B2 B1 & B2 has 12 integers each, right? Current algorithm selects pivot as 13th integer from B, right? If the corresponding upper bound of A is 6th integer. Then it split in A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 After rotate, we have two arrays to merge [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. Sadly, [A1 = 5 | B1 = 12] CANNOT. So we do rotate again, split & merge the two split arrays from [A1 = 5 | B1 = 12] again. But wait, if we always split from the smaller one instead of larger one. After rotate, it promises two split arrays both contain ceiling[small/2]. Since tmp buffer also allocate by starting from sizeof(small) & recursively downgrade by ceiling[small/2^(# of fail allocate)]. It means the allocated tmp buffer promises to be sufficient at the level of (# of fail allocate). Instead, you can see if split from large at level (# of fail allocate) several split array still CANNOT use tmp buffer to do buffered merge. As you know, buffered merge is far speed then (split, rotate, and merge two sub-arrays) (PR c++/83938 gives the profiling figures), the way should provide speedup. Thanks. On 24/01/2018 18:23, François Dumont wrote: Hi It sounds like a very sensitive change to make but nothing worth figures. Do you have any bench showing the importance of the gain ? At least the memory usage optimization is obvious. On 19/01/2018 10:43, chang jc wrote: Current std::inplace_merge() suffers from performance issue by inefficient logic under limited memory, It leads to performance downgrade. Please help to review it. Index: include/bits/stl_algo.h === --- include/bits/stl_algo.h (revision 256871) +++ include/bits/stl_algo.h (working copy) @@ -2437,7 +2437,7 @@ _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { __len11 = __len1 / 2; std::advance(__first_cut, __len11); @@ -2539,9 +2539,15 @@ const _DistanceType __len1 = std::distance(__first, __middle); const _DistanceType __len2 = std::distance(__middle, __last); + typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __last); - + _BidirectionalIterator __start, __end; + if (__len1 < __len2) { + __start = __first; __end = __middle; + } else { + __start = __middle; __end = __last; + } + _TmpBuf __buf(__start, ___end); Note another optimization, we could introduce a _Temporary_buffer<> constructor in order to write: _TmpBuf __
Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
On 12/21/18 9:57 PM, Jonathan Wakely wrote: On 29/10/18 07:06 +0100, François Dumont wrote: Hi Some feedback regarding this patch ? Sorry this got missed, please resubmit during stage 1. You haven't CC'd the original patch author (chang jc) to give them a chance to comment on your proposed changes to the patch. The attached PDF on PR libstdc++/83938 has extensive discussion of the performance issue, but I don't see any for your version. Some performance benchmarks for your version would be helpful. Here is this patch again. This time it is much closer to John's original one, I just kept my change on the size of the temporary buffer which doesn't need to be as large as it is currently allocated, especially with John's patch. The performance tests are showing the good improvements, attached are the before/after. Surprisingly I do not see any change regarding allocated memory, I guess measures are not accurate enough. However working on them also show that current implementation is fragile. If you reduce the allowed allocated memory too much you end up with a stack overflow because of the recursive implementation. I already have a patch to keep on trying to allocate memory as long as above a given threshold rather than 0 but I am also working on making implementation non-recursive. We'll see if even a buffer of size 1 is better than no buffer at all then. PR libstdc++/83938 * include/bits/stl_algo.h: (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. * testsuite/performance/25_algorithms/inplace_merge.cc: New. * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow execution with different level of memory. François inplace_merge.ccreverse 19r 19u 0s48mem0pf inplace_merge.ccforwards 0r0u 0s 0mem0pf inplace_merge.ccrandom18r 18u 0s 0mem0pf inplace_merge.ccreverse 10r 10u 0s 0mem0pf inplace_merge.ccforwards 8r8u 0s 0mem0pf inplace_merge.ccrandom22r 22u 0s 0mem0pf inplace_merge.ccbench 1/2 memory1891r 1884u 7s 928mem0pf inplace_merge.ccreverse 19r 18u 0s 0mem0pf inplace_merge.ccforwards 0r0u 0s 0mem0pf inplace_merge.ccrandom18r 18u 0s 0mem0pf inplace_merge.ccreverse 10r 10u 0s 0mem0pf inplace_merge.ccforwards 3r3u 0s 0mem0pf inplace_merge.ccrandom22r 22u 0s 0mem0pf inplace_merge.ccbench 1/4 memory1867r 1861u 6s 192mem0pf inplace_merge.ccreverse 17r 17u 0s 0mem0pf inplace_merge.ccforwards 0r0u 0s 0mem0pf inplace_merge.ccrandom25r 24u 0s 0mem0pf inplace_merge.ccreverse 28r 28u 0s 0mem0pf inplace_merge.ccforwards 0r0u 0s 0mem0pf inplace_merge.ccrandom 289r 289u 0s 0mem0pf inplace_merge.ccbench no memory 2155r 2149u 6s 192mem0pf stable_sort.cc reverse 240r 240u 1s 0mem0pf stable_sort.cc forwards 205r 205u 0s 0mem0pf stable_sort.cc random 493r 493u 0s 0mem0pf stable_sort.cc bench full memory945r 939u 5s 752mem0pf stable_sort.cc reverse 236r 236u 0s 0mem0pf stable_sort.cc forwards 199r 199u 0s 0mem0pf stable_sort.cc random 487r 487u 0s 0mem0pf stable_sort.cc bench 1/2 memory
Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
Hi I eventually spent much more time working on the inplace_merge performance bench. And the results do not confirm the theory: Before patch: inplace_merge.cc bench 1 / 1 memory 243r 227u 17s 1216mem 5pf inplace_merge.cc bench 1 / 4 memory 297r 278u 18s 480mem 0pf inplace_merge.cc bench 1 /64 memory 373r 354u 18s 480mem 0pf inplace_merge.cc bench 0 / 1 memory 12r 11u 0s 480mem 0pf After the patch to reduce memory allocation: inplace_merge.cc bench 1 / 1 memory 245r 227u 18s 1216mem 0pf inplace_merge.cc bench 1 / 4 memory 292r 273u 19s 480mem 0pf inplace_merge.cc bench 1 /64 memory 373r 356u 17s 480mem 0pf inplace_merge.cc bench 0 / 1 memory 11r 11u 0s 480mem 0pf With the __len1 > __len2 condition change: inplace_merge.cc bench 1 / 1 memory 244r 225u 20s 1216mem 0pf inplace_merge.cc bench 1 / 4 memory 379r 361u 17s 480mem 0pf inplace_merge.cc bench 1 /64 memory 640r 625u 16s 480mem 0pf inplace_merge.cc bench 0 / 1 memory 11r 11u 0s 480mem 0pf When there is no memory limit the results are identical of course. Otherwise as soon as memory is limited performance starts to decrease with the condition change on __len1 vs __len2. Could you give the bench you use to demonstrate the enhancement ? I also wonder why your patch doesn't change consistently the same condition in __merge_without_buffer ? For the moment I'd like to propose the attached patch that is to say the reduction on the amount of allocated memory and the new/modified benches. Note that as soon as I forbid any memory allocation I also reduce the size of the range to merge cause the implementation rely on recursivity and so could end-up in a stack overflow. Maybe I need to check for simulator targets like several other tests ? Unless simulators do not run the performance tests ? Regarding this stack overflow issue, is there some routine to find out how many levels of function calls can be added before reaching the stack overflow ? I could perhaps call __builtin_alloca and check the result but that smells. If I could find out this we could fallback on an iterative approach to complete the merge. PR libstdc++/83938 * include/bits/stl_algo.h: (__inplace_merge): Take temporary buffer length from smallest range. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len computation in the loop. * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible pivot positions. * testsuite/performance/25_algorithms/inplace_merge.cc: New. * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow execution with different memory constraints. Ok to commit ? François On 6/9/19 4:27 PM, François Dumont wrote: On 12/21/18 9:57 PM, Jonathan Wakely wrote: On 29/10/18 07:06 +0100, François Dumont wrote: Hi Some feedback regarding this patch ? Sorry this got missed, please resubmit during stage 1. You haven't CC'd the original patch author (chang jc) to give them a chance to comment on your proposed changes to the patch. The attached PDF on PR libstdc++/83938 has extensive discussion of the performance issue, but I don't see any for your version. Some performance benchmarks for your version would be helpful. Here is this patch again. This time it is much closer to John's original one, I just kept my change on the size of the temporary buffer which doesn't need to be as large as it is currently allocated, especially with John's patch. The performance tests are showing the good improvements, attached are the before/after. Surprisingly I do not see any change regarding allocated memory, I guess measures are not accurate enough. However working on them also show that current implementation is fragile. If you reduce the allowed allocated memory too much you end up with a stack overflow because of the recursive implementation. I already have a patch to keep on trying to allocate memory as long as above a given threshold rather than 0 but I am also working on making implementation non-recursive. We'll see if even a buffer of size 1 is better than no buffer at all then. PR libstdc++/83938 * include/bits/stl_algo.h: (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on all
Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
On 29/10/18 07:06 +0100, François Dumont wrote: Hi Some feedback regarding this patch ? Sorry this got missed, please resubmit during stage 1. You haven't CC'd the original patch author (chang jc) to give them a chance to comment on your proposed changes to the patch. The attached PDF on PR libstdc++/83938 has extensive discussion of the performance issue, but I don't see any for your version. Some performance benchmarks for your version would be helpful. Thanks, François On 8/21/18 10:34 PM, François Dumont wrote: I missed a test that was failing because of this patch. So I revert a small part of it and here is the new proposal. Tested under Linux x86_64, ok to commit ? François On 24/07/2018 12:22, François Dumont wrote: Hi Any chance to review this patch ? François On 06/06/2018 18:39, François Dumont wrote: Hi I review and rework this proposal. I noticed that the same idea to limit buffer size within inplace_merge also apply to stable_sort. I also change the decision when buffer is too small to consider the buffer size rather than going through successive cuts of the original ranges. This way we won't cut the range more than necessary. Note that if you prefer this second part could be committed seperatly. PR libstdc++/83938 * include/bits/stl_algo.h: (__stable_partition_adaptive): When buffer to small cut range using buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__stable_sort_adaptive): When buffer too small cut based on buffer size. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. Tested under Linux x86_64. Ok to commit ? François On 25/01/2018 23:37, chang jc wrote: Hi: 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as __len = (__len == 1) ? 0 : ((__len + 1) / 2); 2. The coding gain has been shown PR c++/83938; I re-post here 21 20 19 18 17 16 0.471136 0.625695 0.767262 0.907461 1.04838 1.19508 0.340845 0.48651 0.639139 0.770133 0.898454 1.04632 it means when Merge [0, 4325376, 16777216); A is a sorted integer with 4325376 & B with 12451840 elements, total with 16M integers The proposed method has the speed up under given buffer size =, ex 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means given sizof(int)*64K bytes. 3. As your suggestion, _TmpBuf __buf should be rewrite. 4. It represents a fact that the intuitive idea to split from larger part is wrong. For example, if you have an input sorted array A & B, A has 8 integers & B has 24 integers. Given tmp buffer whose capacity = 4 integers. Current it tries to split from B, right? Then we have: A1 | A2 B1 | B2 B1 & B2 has 12 integers each, right? Current algorithm selects pivot as 13th integer from B, right? If the corresponding upper bound of A is 6th integer. Then it split in A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 After rotate, we have two arrays to merge [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. Sadly, [A1 = 5 | B1 = 12] CANNOT. So we do rotate again, split & merge the two split arrays from [A1 = 5 | B1 = 12] again. But wait, if we always split from the smaller one instead of larger one. After rotate, it promises two split arrays both contain ceiling[small/2]. Since tmp buffer also allocate by starting from sizeof(small) & recursively downgrade by ceiling[small/2^(# of fail allocate)]. It means the allocated tmp buffer promises to be sufficient at the level of (# of fail allocate). Instead, you can see if split from large at level (# of fail allocate) several split array still CANNOT use tmp buffer to do buffered merge. As you know, buffered merge is far speed then (split, rotate, and merge two sub-arrays) (PR c++/83938 gives the profiling figures), the way should provide speedup. Thanks. On 24/01/2018 18:23, François Dumont wrote: Hi It sounds like a very sensitive change to make but nothing worth figures. Do you have any bench showing the importance of the gain ? At least the memory usage optimization is obvious. On 19/01/2018 10:43, chang jc wrote: Current std::inplace_merge() suffers from performance issue by inefficient logic under limited memory, It leads to performance downgrade. Please help to review it. Index: include/bits/stl_algo.h === --- include/bits/stl_algo.h (revision 256871) +++ include/bits/stl_algo.h (working copy) @@ -2437,7 +2437,7 @@ _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1
Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)#
On 16/07/19 18:40 +0200, François Dumont wrote: Hi    I eventually spent much more time working on the inplace_merge performance bench.    And the results do not confirm the theory: Before patch: inplace_merge.cc           bench 1 / 1 memory          243r 227u  17s     1216mem   5pf inplace_merge.cc           bench 1 / 4 memory          297r 278u  18s      480mem   0pf inplace_merge.cc           bench 1 /64 memory         373r 354u  18s      480mem   0pf inplace_merge.cc           bench 0 / 1 memory          12r 11u    0s      480mem   0pf After the patch to reduce memory allocation: inplace_merge.cc           bench 1 / 1 memory          245r 227u  18s     1216mem   0pf inplace_merge.cc           bench 1 / 4 memory          292r 273u  19s      480mem   0pf inplace_merge.cc           bench 1 /64 memory         373r 356u  17s      480mem   0pf inplace_merge.cc           bench 0 / 1 memory          11r 11u    0s      480mem   0pf With the __len1 > __len2 condition change: inplace_merge.cc           bench 1 / 1 memory          244r 225u  20s     1216mem   0pf inplace_merge.cc           bench 1 / 4 memory          379r 361u  17s      480mem   0pf inplace_merge.cc           bench 1 /64 memory          640r 625u  16s      480mem   0pf inplace_merge.cc           bench 0 / 1 memory          11r 11u   0s      480mem   0pf When there is no memory limit the results are identical of course. Otherwise as soon as memory is limited performance starts to decrease with the condition change on __len1 vs __len2. Could you give the bench you use to demonstrate the enhancement ? I also wonder why your patch doesn't change consistently the same condition in __merge_without_buffer ? For the moment I'd like to propose the attached patch that is to say the reduction on the amount of allocated memory and the new/modified benches. Note that as soon as I forbid any memory allocation I also reduce the size of the range to merge cause the implementation rely on recursivity and so could end-up in a stack overflow. Maybe I need to check for simulator targets like several other tests ? Unless simulators do not run the performance tests ? The performance tests are never run by default. I don't think we should spend too much time caring about performance of sorting close to Out Of Memory conditions. We don't try to optimize std::vector or other cases to work when close to OOM. So I think just reducing the memory usage is the right approach here. Regarding this stack overflow issue, is there some routine to find out how many levels of function calls can be added before reaching the stack overflow ? I could perhaps call __builtin_alloca and check the result but that smells. If I could find out this we could fallback on an iterative approach to complete the merge. No, alloca is inherently unsafe. Please don't even consider it :-)    PR libstdc++/83938    * include/bits/stl_algo.h:    (__inplace_merge): Take temporary buffer length from smallest range.    (__stable_sort): Limit temporary buffer length.    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len    computation in the loop. Please use "Change __len computation in the loop to avoid truncation." It's a bit more descriptive.    * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible    pivot positions.    * testsuite/performance/25_algorithms/inplace_merge.cc: New.    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow    execution with different memory constraints. Ok to commit ? François On 6/9/19 4:27 PM, François Dumont wrote: On 12/21/18 9:57 PM, Jonathan Wakely wrote: On 29/10/18 07:06 +0100, François Dumont wrote: Hi    Some feedback regarding this patch ? Sorry this got missed, please resubmit during stage 1. You haven't CC'd the original patch author (chang jc) to give them a chance to comment on your proposed changes to the patch. The attached PDF on PR libstdc++/83938 has extensive discussion of the performance issue, but I don't see any for your version. Some performance benchmarks for your version would be helpful. Here is this patch again. This time it is much closer to John's original one, I just kept my change on the size of the temporary buffer which doesn't need to be as large as it is currently allocated, especially with John's patch. The performance tests are showing the good improvements, attached are the before/after.
Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
Hi Some feedback regarding this patch ? Thanks, François On 8/21/18 10:34 PM, François Dumont wrote: I missed a test that was failing because of this patch. So I revert a small part of it and here is the new proposal. Tested under Linux x86_64, ok to commit ? François On 24/07/2018 12:22, François Dumont wrote: Hi Any chance to review this patch ? François On 06/06/2018 18:39, François Dumont wrote: Hi I review and rework this proposal. I noticed that the same idea to limit buffer size within inplace_merge also apply to stable_sort. I also change the decision when buffer is too small to consider the buffer size rather than going through successive cuts of the original ranges. This way we won't cut the range more than necessary. Note that if you prefer this second part could be committed seperatly. PR libstdc++/83938 * include/bits/stl_algo.h: (__stable_partition_adaptive): When buffer to small cut range using buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__stable_sort_adaptive): When buffer too small cut based on buffer size. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. Tested under Linux x86_64. Ok to commit ? François On 25/01/2018 23:37, chang jc wrote: Hi: 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as __len = (__len == 1) ? 0 : ((__len + 1) / 2); 2. The coding gain has been shown PR c++/83938; I re-post here 21 20 19 18 17 16 0.471136 0.625695 0.767262 0.907461 1.04838 1.19508 0.340845 0.48651 0.639139 0.770133 0.898454 1.04632 it means when Merge [0, 4325376, 16777216); A is a sorted integer with 4325376 & B with 12451840 elements, total with 16M integers The proposed method has the speed up under given buffer size =, ex 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means given sizof(int)*64K bytes. 3. As your suggestion, _TmpBuf __buf should be rewrite. 4. It represents a fact that the intuitive idea to split from larger part is wrong. For example, if you have an input sorted array A & B, A has 8 integers & B has 24 integers. Given tmp buffer whose capacity = 4 integers. Current it tries to split from B, right? Then we have: A1 | A2 B1 | B2 B1 & B2 has 12 integers each, right? Current algorithm selects pivot as 13th integer from B, right? If the corresponding upper bound of A is 6th integer. Then it split in A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 After rotate, we have two arrays to merge [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. Sadly, [A1 = 5 | B1 = 12] CANNOT. So we do rotate again, split & merge the two split arrays from [A1 = 5 | B1 = 12] again. But wait, if we always split from the smaller one instead of larger one. After rotate, it promises two split arrays both contain ceiling[small/2]. Since tmp buffer also allocate by starting from sizeof(small) & recursively downgrade by ceiling[small/2^(# of fail allocate)]. It means the allocated tmp buffer promises to be sufficient at the level of (# of fail allocate). Instead, you can see if split from large at level (# of fail allocate) several split array still CANNOT use tmp buffer to do buffered merge. As you know, buffered merge is far speed then (split, rotate, and merge two sub-arrays) (PR c++/83938 gives the profiling figures), the way should provide speedup. Thanks. On 24/01/2018 18:23, François Dumont wrote: Hi It sounds like a very sensitive change to make but nothing worth figures. Do you have any bench showing the importance of the gain ? At least the memory usage optimization is obvious. On 19/01/2018 10:43, chang jc wrote: Current std::inplace_merge() suffers from performance issue by inefficient logic under limited memory, It leads to performance downgrade. Please help to review it. Index: include/bits/stl_algo.h === --- include/bits/stl_algo.h (revision 256871) +++ include/bits/stl_algo.h (working copy) @@ -2437,7 +2437,7 @@ _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { __len11 = __len1 / 2; std::advance(__first_cut, __len11); @@ -2539,9 +2539,15 @@ const _DistanceType __len1 = std::distance(__first, __middle); const _DistanceType __len2 = std::distance(__middle, __last); + typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __last); - +
Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
I missed a test that was failing because of this patch. So I revert a small part of it and here is the new proposal. Tested under Linux x86_64, ok to commit ? François On 24/07/2018 12:22, François Dumont wrote: Hi Any chance to review this patch ? François On 06/06/2018 18:39, François Dumont wrote: Hi I review and rework this proposal. I noticed that the same idea to limit buffer size within inplace_merge also apply to stable_sort. I also change the decision when buffer is too small to consider the buffer size rather than going through successive cuts of the original ranges. This way we won't cut the range more than necessary. Note that if you prefer this second part could be committed seperatly. PR libstdc++/83938 * include/bits/stl_algo.h: (__stable_partition_adaptive): When buffer to small cut range using buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__stable_sort_adaptive): When buffer too small cut based on buffer size. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. Tested under Linux x86_64. Ok to commit ? François On 25/01/2018 23:37, chang jc wrote: Hi: 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as __len = (__len == 1) ? 0 : ((__len + 1) / 2); 2. The coding gain has been shown PR c++/83938; I re-post here 21 20 19 18 17 16 0.471136 0.625695 0.767262 0.907461 1.04838 1.19508 0.340845 0.48651 0.639139 0.770133 0.898454 1.04632 it means when Merge [0, 4325376, 16777216); A is a sorted integer with 4325376 & B with 12451840 elements, total with 16M integers The proposed method has the speed up under given buffer size =, ex 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means given sizof(int)*64K bytes. 3. As your suggestion, _TmpBuf __buf should be rewrite. 4. It represents a fact that the intuitive idea to split from larger part is wrong. For example, if you have an input sorted array A & B, A has 8 integers & B has 24 integers. Given tmp buffer whose capacity = 4 integers. Current it tries to split from B, right? Then we have: A1 | A2 B1 | B2 B1 & B2 has 12 integers each, right? Current algorithm selects pivot as 13th integer from B, right? If the corresponding upper bound of A is 6th integer. Then it split in A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 After rotate, we have two arrays to merge [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. Sadly, [A1 = 5 | B1 = 12] CANNOT. So we do rotate again, split & merge the two split arrays from [A1 = 5 | B1 = 12] again. But wait, if we always split from the smaller one instead of larger one. After rotate, it promises two split arrays both contain ceiling[small/2]. Since tmp buffer also allocate by starting from sizeof(small) & recursively downgrade by ceiling[small/2^(# of fail allocate)]. It means the allocated tmp buffer promises to be sufficient at the level of (# of fail allocate). Instead, you can see if split from large at level (# of fail allocate) several split array still CANNOT use tmp buffer to do buffered merge. As you know, buffered merge is far speed then (split, rotate, and merge two sub-arrays) (PR c++/83938 gives the profiling figures), the way should provide speedup. Thanks. On 24/01/2018 18:23, François Dumont wrote: Hi It sounds like a very sensitive change to make but nothing worth figures. Do you have any bench showing the importance of the gain ? At least the memory usage optimization is obvious. On 19/01/2018 10:43, chang jc wrote: Current std::inplace_merge() suffers from performance issue by inefficient logic under limited memory, It leads to performance downgrade. Please help to review it. Index: include/bits/stl_algo.h === --- include/bits/stl_algo.h (revision 256871) +++ include/bits/stl_algo.h (working copy) @@ -2437,7 +2437,7 @@ _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { __len11 = __len1 / 2; std::advance(__first_cut, __len11); @@ -2539,9 +2539,15 @@ const _DistanceType __len1 = std::distance(__first, __middle); const _DistanceType __len2 = std::distance(__middle, __last); + typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __last); - + _BidirectionalIterator __start, __end; + if (__len1 < __len2) { + __start = __first; __end = __middle