hadrian-reppas commented on issue #46629:
URL: https://github.com/apache/arrow/issues/46629#issuecomment-2997813098

   Hi, I'm taking a look at this issue and had a few questions:
   1. What are some situations where schemas are normalized when reading the 
files? The only `FragmentEvolutionStrategy` I have found is 
`BasicFragmentEvolution` which dosen't handle type promotions. Or are you 
talking about the call to `UnifySchemas` in `DatasetFactory::Inspect`?
   2. If so, it looks like the C++ API already supports this:
   ```cpp
   std::string path1 = "dataset/int8.parquet"; // value: 
dictionary<values=string, indices=int8, ordered=0>
   std::string path2 = "dataset/int16.parquet"; // value: 
dictionary<values=string, indices=int16, ordered=0>
   
   auto factory = FileSystemDatasetFactory::Make(
       std::make_shared<arrow::fs::LocalFileSystem>(), {path1, path2},
       std::make_shared<ParquetFileFormat>(), 
FileSystemFactoryOptions{}).ValueOrDie();
     
   InspectOptions options;
   options.fragments = InspectOptions::kInspectAllFragments;
   options.field_merge_options = Field::MergeOptions::Permissive();
   auto schema = factory->Inspect(options).ValueOrDie(); // value: 
dictionary<values=string, indices=int16, ordered=0>
   
   auto dataset = factory->Finish(schema).ValueOrDie();
   auto scanner = dataset->NewScan().ValueOrDie()->Finish().ValueOrDie();
   auto table = scanner->ToTable().ValueOrDie(); // value: 
dictionary<values=string, indices=int16, ordered=0>
   ```
   It seems like doing it this way in Python is currently impossible because 
the `FileSystemDatasetFactory.inspect` method [does not take an `options` 
argument](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDatasetFactory.html#pyarrow.dataset.FileSystemDatasetFactory.inspect).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to