[GitHub] [arrow] aersam commented on issue #36593: [Python] Add rename_columns to DataSet

via GitHub Wed, 02 Aug 2023 06:28:56 -0700


aersam commented on issue #36593:
URL: https://github.com/apache/arrow/issues/36593#issuecomment-1662215807


   Seems using replace_schema does not work. The dataset always uses those 
column names to query the parquet, meaning the column names must match the ones 
in physical files. What really is needed is a separation between physical 
column name and logical column name. This would be really great, especially 
since parquet is a bit limited in what column names are allowed. 
   The best would be to have a "column mapping" in the 
[fragment](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Fragment.html)
  which would map the schema column names to physical column names. This would 
allow making queries with parquets with different physical column for the same 
logical column name. I guess that's a bit complex regarding the filters... but 
still would be great.
   
   If we'd want to abstract Apache Iceberg oder Delta Lake Tables with the 
dataset, this would be needed (both support such column mapping stuff)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] aersam commented on issue #36593: [Python] Add rename_columns to DataSet

Reply via email to