aersam commented on issue #36593: URL: https://github.com/apache/arrow/issues/36593#issuecomment-1662215807
Seems using replace_schema does not work. The dataset always uses those column names to query the parquet, meaning the column names must match the ones in physical files. What really is needed is a separation between physical column name and logical column name. This would be really great, especially since parquet is a bit limited in what column names are allowed. The best would be to have a "column mapping" in the [fragment](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Fragment.html) which would map the schema column names to physical column names. This would allow making queries with parquets with different physical column for the same logical column name. I guess that's a bit complex regarding the filters... but still would be great. If we'd want to abstract Apache Iceberg oder Delta Lake Tables with the dataset, this would be needed (both support such column mapping stuff) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
