BlakeOrth opened a new pull request, #17050:
URL: https://github.com/apache/datafusion/pull/17050

   ## Which issue does this PR close?
   
   - Closes #17049
   
   ## What changes are included in this PR?
   
    - Fixes an issue in the ListingTableFactory where hive columns are not 
detected and incorporated into the table schema when an explicit schema has not 
been set by the user
    - Fixes an issue where NO files are detected when a path that represents a 
collection has a . in the final element of the prefix because the contents 
following the . was interpreted as a file extension (i.e. 
s3://bucket/prefix/version.v1/ would only attempt to list files with ending 
with '.v1' instead of the expected extension such as .csv or .parquet)
    - Fixes an issue where subdirectories that do not follow Hive formatting 
(e.g. key=value) could be erroneously interpreted as contributing to the table 
schema
   
   ## Are these changes tested?
   
   I'm initially submitting this as a draft PR without tests to provide a solid 
basis for discussion on whether or not this is the desired solution to the 
linked issue. If/when the solution to the PR is ready to merge I will make 
additional commits to address feedback as well as implement tests for the 
solution. At present, the changes have been tested functionally using 
`datafusion-cli` and the public dataset noted in the issue.
   
   ## Are there any user-facing changes?
   
   Part of the reason I've left this in draft is because I think there's a 
possibility for the changes to impact users of the `ListingTableFactory`. In my 
mind the behavior represented here is what I would think the "expected" 
behavior should be, but there's a good possibility users are relying on the 
previous behavior and could get unexpected results if this PR is merged.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to