alamb opened a new issue, #105: URL: https://github.com/apache/parquet-testing/issues/105
- Part of https://github.com/apache/parquet-format/issues/533 Indeed, as part of adding the Adaptive Lossless Floating-Point encoding to the Parquet standard, we should provide sample files in the parquet-testing repo that other implementations can use to verify they correctly read such files Quoting @CurtHagenlocher [on the dev list](https://lists.apache.org/thread/j8t5g0lpky00c6m1ftkz1jcykmf7snvk) > As part of the process of amending the Parquet format, perhaps it would be a good idea for early implementations to generate sample files and commit them to [apache/parquet-testing: Apache Parquet Testing](https://github.com/apache/parquet-testing) for other implementations to leverage? # Suggested Requirements ## Size Given that the parquet-format repository is checked out many times by many different repositories as part of CI, it is important to keep the size of these example files down. They should typically be no more than a few kb in size at most ## Reference Values I suggest we follow the model of BYTE_STREAM_SPLIT (see [here](https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types)) and create a single parquet file that has multiple columns with the different test and validation sets. For example, one column of `PLAIN` encoded f32 and a column of `PLAIN` encoded f64 as baseline and then several columns of the same data encoded using ALP with different parameters (to cover parts of the spec) ## ALP / patterns We should ensure the dataset has ALP data with the following properties: * Vectors with no exceptions * Vectors with NAN, INF, etc * Vectors with many/most exceptions (e.g. random float data) * All possible ALP bit widths sizes (1 -> 15 == 65k) * Both f32 and f64 ## Documentation Here is some other Documentation that I think shows the best practice * https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types * https://github.com/apache/parquet-testing/tree/master/variant#descriptions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
