alamb opened a new issue, #105:
URL: https://github.com/apache/parquet-testing/issues/105

   - Part of https://github.com/apache/parquet-format/issues/533
   
   Indeed, as  part of adding the Adaptive Lossless Floating-Point encoding to 
the Parquet standard, we should provide sample files in the parquet-testing 
repo that other implementations can use to verify they correctly read such files
   
   Quoting @CurtHagenlocher [on the dev 
list](https://lists.apache.org/thread/j8t5g0lpky00c6m1ftkz1jcykmf7snvk)
   
   > As part of the process of amending the Parquet format, perhaps it would be 
a good idea for early implementations to generate sample files and commit them 
to [apache/parquet-testing: Apache Parquet 
Testing](https://github.com/apache/parquet-testing) for other implementations 
to leverage?
   
   # Suggested Requirements
   
   ## Size
   
   Given that the parquet-format repository is checked out many times by many 
different repositories as part of CI, it is important to keep the size of these 
example files down. They should typically be no more than a few kb in size at 
most
   
   ## Reference Values
   I suggest we follow the model of BYTE_STREAM_SPLIT (see 
[here](https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types))
 and create a single parquet file that has multiple columns  with the different 
test and validation sets.
   
   For example, one column of `PLAIN` encoded f32 and a column of `PLAIN` 
encoded f64 as baseline and then several columns of the same data encoded using 
ALP with different parameters (to cover parts of the spec)
   
   ## ALP / patterns
   
   We should ensure the dataset has ALP data with the following properties:
   * Vectors with no exceptions
   * Vectors with  NAN, INF, etc
   * Vectors with many/most exceptions (e.g. random float data)
   * All possible ALP bit widths sizes (1 -> 15 == 65k)
   * Both f32 and f64
   
   ## Documentation
   
   Here is some other Documentation that I think shows the best practice
   * 
https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types
   * https://github.com/apache/parquet-testing/tree/master/variant#descriptions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to