cboettig commented on issue #38724:
URL: https://github.com/apache/arrow/issues/38724#issuecomment-1817590796

   Thanks @amoeba !  Agree that consistency across bindings would be great.  
Good question about breaking changes.  The two cases I would think of is if the 
same column name already exists in the parquet itself, or the user is manually 
adding the additional column.  At least in the R interface neither of these are 
breaking changes, (since arrow is already comfortable opening a partitioned 
dataset that also has those columns hardcoded into the parquet/csv, and since 
re-constructing the partition columns with `mutate(path=add_filename(), col1 = 
str_extract(...)` would still work).  
   
   But always a chance of breaking changes not directly related to this (code 
that assumes a certain number/order of columns in the current behavior), so I'd 
be happy if this was an opt-in argument to `hive_style` ... maybe?  (Though the 
existing documentation says:
   
   > should partitioning be interpreted as Hive-style? Default is NA, which 
means to inspect the file paths for Hive-style partitioning and behave 
accordingly.
   
   which is misleading, because in fact it only inspects sub-paths, not the 
full path given to sources.  Maybe something like `hive_style = 
"relative_path"` or `"full_path"` could distinguish this behavior? 
   
    It may be worth considering making this the default behavior eventually.  
On balance, there are probably more users who would already assume that arrow 
would parse hive notation anywhere in the path than users who would explicitly 
rely on the current behavior of only looking at the relative path that comes 
after their given source path?  (i.e. arguably, the current behavior feels more 
like a bug relative to the documented behavior, rather than a missing 
additional feature?).  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to