Parquet Sync Notes July 17th 2024 Attendees:
- Parth: contributor to Apache Drill, Arrow-java. Currently working on DataFusion-Comet (Spark impl). Topics: What is the direction of Parquet “V3”. Can we make Parquet on S3 easier and better? - Micah: OSS Data formats for BigQuery at Google. Topics: Interested in keeping Parquet current. How do we manage incompatible changes and releases for Parquet java. - Gabor: Dremio, Parquet community member for a long time. Topics: Interested in the “Parquet V3” initiative - Alkis: Databricks storage stacks (scans, clouds). Worked at Google on fancy encodings in the past. Topics: Discussing Parquet metadata handling and new encodings. Specifically, the extension proposal: [EXTERNAL] Parquet extensions <https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6> - Fokko: Tabular/Databricks. Topics: deprecation of old Hadoop and Java support. Jira -> github migration. - Ed: Lawrence Livermore: power user of Parquet. Topics: How to use Parquet with GPUs. Cudf. Parquet improvement process. - Julien: Datadog, very early committer/founder. Topics: Interested in making Parquet current, new encodings. How do we solve footer metadata scaling. How to make incompatible changes without breaking everyone. How to add new encodings but not too many. How to improve timeseries on top of parquet. How to merge sorted files in an efficient way. - Nong: Databricks, very early committer/founder. Topics: Interested in the V3 stuff. Agenda: (built from topics above) - Direction of “Parquet V3”: - Parquet on S3? - Merging sorted files efficiently. - More efficient for GPUs? - Alkis’ extension proposal [EXTERNAL] Parquet extensions <https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6> - Mika’s Parquet format changes process (we didn’t get to it) - Jira -> github migration (we didn’t get to it) Notes: - Direction for Parquet improvement: - Rolling out incompatible changes: - Duplicate information until we’re satisfied: for metadata and for encodings as well - Better metadata: - Support for wide schemas - Better encodings: - More efficient for GPUs? => parallelizable - Better for time series. Delta encodings? - BTR blocks… - Merge Sorted files on the fly: - Ability to read the file from the beginning as a whole. Since we’re reading the whole file anyway - Side effect of being able to ignore the footer and still read the file in case of corruption. - Log-structure merge of parquet files. Ed: interested in doing this for GPUs. Need better prediction of size of data to load into GPU memory. - [EXTERNAL] Parquet extensions <https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6> : - Need an extension mechanism to experiment with a new footer or new metadata. - In a fully forward/backward manner. - Proposed: - Reserved a binary field in thrift that will not be used by the spec. - Thrift will ignore field ids that are unknown and therefore old readers will just ignore it. - Discussion: - 2 use cases: - Proprietary extensions - New Parquet footer. - Remove the notion of vendor to focus on the path to migrate to the new footer. - Action Items: - Alkis to integrate feedback in his proposal and follow up on the list to finalize - Action Items: - Since we didn’t get to 2 items, we agreed to change the meeting to bi-weekly for now until it’s not needed. The next meeting is July 31st at 9:30am PT - 12:30pm ET - 6:30pm CET - Join here: https://calendar.app.google/UMMmbUV1JMh7ffGt6 - Follow up on the items we didn’t get to on the mailing list: - Mika’s proposal for the Parquet format changes process - Jira -> github migration
