Hi all, Project link: https://github.com/spiraldb/raincloud
Thanks for the warm reception at the Parquet sync! It was great to see support for the project's initial prototype. Raincloud is a pipeline that assembles a curated catalog of public datasets into Parquet files. I want Raincloud to address two needs: First, it should curate a collection of real-world datasets that we, as a community, agree are useful for evaluating file formats. Second, it should make accessing these files easy. I've selected over 200 datasets from Kaggle, Hugging Face, and directly hosted sources I'm familiar with (e.g., NYC Taxi and Public BI) to serve as an initial catalog for this project. Raincloud uses this catalog to fetch and process each dataset into a Parquet file. A TUI is bundled with the pipeline to explore the catalog, along with human- and AI-focused documentation to minimize setup friction. This project is very much an early effort; issues, PRs, and dataset suggestions are all welcome. Currently, Raincloud uses Python-based tooling to generate Parquet files, with an optional path to generate Vortex files. I want Raincloud to support additional formats and Parquet writers, which I'm less familiar with. The folks at Spiral have been highly supportive of this effort and generally want to maintain a "hands-off" attitude, which I'm grateful for. I want Raincloud to be useful for data-driven testing of file formats, whether for research, CI, or other use cases. Best, Martin
