gruuya commented on PR #7180:
URL: 
https://github.com/apache/arrow-datafusion/pull/7180#issuecomment-1663961615

   > I also was thinking we can probably write a test for this behavior in 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/memory_limit.rs
   > 
   > I have something similar in [#7130 
(review)](https://github.com/apache/arrow-datafusion/pull/7130#pullrequestreview-1559634018)
   > 
   > I can potentially help writing that test if you agree it would be helpful
   
   Thanks, I certainly planned on adding tests, just wanted to concentrate on 
the benchmarks first. I'm thinking of starting with `order_spill_fuzz.rs` file 
for starters, extend the cases for some limit numbers, then if it's needed I 
can extend 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/memory_limit.rs
 as well.
   
   Incidentally, there's a [failing 
test](https://github.com/apache/arrow-datafusion/actions/runs/5749956405/job/15586047326?pr=7180#step:6:2884)
 now, which I've been looking at and it's a bit curious—the difference comes 
down to the fact that previously processed batch size was a bit different (due 
to the fetch) and the intermediate sort call was stable (i.e. preserved order), 
while currently it's unstable (didn't preserve order).
   
   To be more specific, in this particular 
[test](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/window.slt#L3198-L3215)
 there are always 2 input batches (I've added the column C3 for more clarity).
   One with 99 rows
   ```rust
   [
       
"+--------------------------------------------------------------------------------------------------------------------------------+----------------------+------+",
       "| MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC 
NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | max1            
     | c3   |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+----------------------+------+",
       "| 0.01479305307777301                                                   
                                                         | 0.01479305307777301  
| -86  |",
       "| 0.02182578039211991                                                   
                                                         | 0.02182578039211991  
| -38  |",
       "| 0.03968347085780355                                                   
                                                         | 0.03968347085780355  
| 30   |",
       "| 0.04429073092078406                                                   
                                                         | 0.04429073092078406  
| 36   |",
       "| 0.047343434291126085                                                  
                                                         | 0.047343434291126085 
| -95  |",
       ...
       "| 0.980809631269599                                                     
                                                         | 0.980809631269599    
| 52   |",
       "| 0.991517828651004                                                     
                                                         | 0.991517828651004    
| 29   |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+----------------------+------+",
   ]
   ```
   and one with 1 row:
   ```rust
   [
       
"+--------------------------------------------------------------------------------------------------------------------------------+--------------------+-----+",
       "| MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC 
NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | max1            
   | c3  |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+--------------------+-----+",
       "| 0.9965400387585364                                                    
                                                         | 0.9965400387585364 | 
120 |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+--------------------+-----+",
   ]
   ```
   In the case of this PR the first batch is sorted to:
   ```rust
   [
       
"+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------+",
       "| MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC 
NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | max1            
    | c3   |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------+",
       "| 0.8506721053047003                                                    
                                                         | 0.8506721053047003  
| -117 |",
       "| 0.9706712283358269                                                    
                                                         | 0.9706712283358269  
| -117 |",
       "| 0.152498292971736                                                     
                                                         | 0.152498292971736   
| -111 |",
       "| 0.36936304600612724                                                   
                                                         | 0.36936304600612724 
| -107 |",
       "| 0.565352842229935                                                     
                                                         | 0.565352842229935   
| -106 |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------+",
   ]
   ```
   I.e. without preserving the order for the same C3
   On main this isn't manifested since the two batches get concatenated into a 
batch with 100 rows and the sorting of that batch turns out to be stable for 
some reason:
   ```rust
   [
       
"+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------+",
       "| MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC 
NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | max1            
    | c3   |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------+",
       "| 0.9706712283358269                                                    
                                                         | 0.9706712283358269  
| -117 |",
       "| 0.8506721053047003                                                    
                                                         | 0.8506721053047003  
| -117 |",
       "| 0.152498292971736                                                     
                                                         | 0.152498292971736   
| -111 |",
       "| 0.36936304600612724                                                   
                                                         | 0.36936304600612724 
| -107 |",
       "| 0.565352842229935                                                     
                                                         | 0.565352842229935   
| -106 |",
       
"+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------+",
   ]
   ```
   
   Still need to look into this a bit more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to