hadrian-reppas commented on issue #46629: URL: https://github.com/apache/arrow/issues/46629#issuecomment-2997813098
Hi, I'm taking a look at this issue and had a few questions: 1. What are some situations where schemas are normalized when reading the files? The only `FragmentEvolutionStrategy` I have found is `BasicFragmentEvolution` which dosen't handle type promotions. Or are you talking about the call to `UnifySchemas` in `DatasetFactory::Inspect`? 2. If so, it looks like the C++ API already supports this: ```cpp std::string path1 = "dataset/int8.parquet"; // value: dictionary<values=string, indices=int8, ordered=0> std::string path2 = "dataset/int16.parquet"; // value: dictionary<values=string, indices=int16, ordered=0> auto factory = FileSystemDatasetFactory::Make( std::make_shared<arrow::fs::LocalFileSystem>(), {path1, path2}, std::make_shared<ParquetFileFormat>(), FileSystemFactoryOptions{}).ValueOrDie(); InspectOptions options; options.fragments = InspectOptions::kInspectAllFragments; options.field_merge_options = Field::MergeOptions::Permissive(); auto schema = factory->Inspect(options).ValueOrDie(); // value: dictionary<values=string, indices=int16, ordered=0> auto dataset = factory->Finish(schema).ValueOrDie(); auto scanner = dataset->NewScan().ValueOrDie()->Finish().ValueOrDie(); auto table = scanner->ToTable().ValueOrDie(); // value: dictionary<values=string, indices=int16, ordered=0> ``` It seems like doing it this way in Python is currently impossible because the `FileSystemDatasetFactory.inspect` method [does not take an `options` argument](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDatasetFactory.html#pyarrow.dataset.FileSystemDatasetFactory.inspect). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org