Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-07-29 Thread via GitHub


alamb commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2256432095

   Sorry for the delay @leoyvens and thank you for this analysis
   
   > https://github.com/apache/datafusion/issues/11170
   
   I would personally love to take this approach 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-07-23 Thread via GitHub


leoyvens commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2246022064

   One thing I've noticed is that after DataFusion 40 this actually works in my 
use case, likely thanks to the statistics code getting fixed, so good news 
there! It does require additionally setting `execution.collect_statistics = 
true`, which makes sense.
   
   However for my entirely sorted and non-overlapping dataset it did make 
Parquet scanning single-threaded (`ParquetScan` with a single file group), 
which is a big performance regression. So it didn't really help me, maybe I 
actually want #10316.
   
   The consequence to this issue being that turning this on by default would 
regress performance for users that have `execution.collect_statistics = true`. 
Maybe the flag should be merged with `prefer_existing_sort`, which has the 
semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just 
wait for #10316, so we can both avoid the sort and still have a parallel 
`ParquetExec`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-05-04 Thread via GitHub


alamb commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2094127979

   THank you @yyy1000  🙏 
   
   I think a good place to start would be to write some sqllogic level tests to 
cover the important cases
   
   Perhaos for the first test:
   1. Create files: file1.parquet, file2.parquet both sorted on `a` but file 1 
has the columns in the order `a, b, c` and file has the columns in the order 
`c, b, a`. The keyranges of values of a should be non overlapping
   2. Create an external table `a, b, c` with explicit order by `a,` and then 
query `SELECT ... ORDER BY a` and make sure the output plan doesn't use sort 
preserving merge
   
   
   I think we could extend 
datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-05-03 Thread via GitHub


yyy1000 commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2093968410

   I'd like to help it. 🙌


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-05-01 Thread via GitHub


alamb commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2089121776

   Example test coverage we should add I think: 
https://github.com/apache/datafusion/pull/9593#discussion_r1585517605


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-05-01 Thread via GitHub


alamb opened a new issue, #10336:
URL: https://github.com/apache/datafusion/issues/10336

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/10313
   
   In https://github.com/apache/datafusion/pull/9593, @suremarc added a way to 
reorganize input files in a ListingTable to avoid a merge, if the sort key 
ranges do not overlap
   
   This feature is behind a feature flag, `split_file_groups_by_statistics`  
which defaults to `false` as I think there needs to be some more tests in place 
before we turn it on 
   
   ### Describe the solution you'd like
   
   Add additional tests and then enable `split_file_groups_by_statistics` by 
default
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org