tustvold opened a new issue #1032: URL: https://github.com/apache/arrow-rs/issues/1032
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** The current API for the parquet crate is rather large, and exposes quite a lot of implementation detail. This has a couple of implications: * It complicates iterating on the crate without making breaking changes to public APIs * It adds to user's cognitive load as they have to work out what APIs to use Some examples of this * The `util` module contains all sorts of random stuff - a hash implementation, maths functions, memory tracking, etc... * The `compression` module * `data_type::AsBytes`, `data_type::SliceAsBytes`, `data_type::SliceAsBytesDataType` * `data_type::DataType`, `ColumnReaderImpl`, `RecordReader` * `schema::types::to_thrift` **Describe the solution you'd like** I'm not familiar enough with the design of the crate to authoritatively weigh in on what should or shouldn't be public, however, it is my observation that a number of the APIs don't appear to be optimised for external consumption. My **personal** preference would be to make everything lower than the file-level, i.e. `SerializedFileReader`, `ParquetFileArrowReader`, `RowIter` crate-local. This would have the benefit of being pretty unambiguous and easy to communicate and maintain. This would obviously need to be made in a major arrow release, the next of which I believe is in January 2022 (@alamb could maybe confirm). I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs could be made public again in a point-release based on user feedback. I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
