Hi all,

Project link: https://github.com/spiraldb/raincloud

Thanks for the warm reception at the Parquet sync! It was great to see
support for the project's initial prototype.

Raincloud is a pipeline that assembles a curated catalog of public datasets
into Parquet files. I want Raincloud to address two needs: First, it should
curate a collection of real-world datasets that we, as a community, agree
are useful for evaluating file formats. Second, it should make accessing
these files easy. I've selected over 200 datasets from Kaggle, Hugging
Face, and directly hosted sources I'm familiar with (e.g., NYC Taxi and
Public BI) to serve as an initial catalog for this project. Raincloud uses
this catalog to fetch and process each dataset into a Parquet file. A TUI
is bundled with the pipeline to explore the catalog, along with human- and
AI-focused documentation to minimize setup friction.

This project is very much an early effort; issues, PRs, and dataset
suggestions are all welcome. Currently, Raincloud uses Python-based tooling
to generate Parquet files, with an optional path to generate Vortex files.
I want Raincloud to support additional formats and Parquet writers, which
I'm less familiar with.

The folks at Spiral have been highly supportive of this effort and
generally want to maintain a "hands-off" attitude, which I'm grateful for.
I want Raincloud to be useful for data-driven testing of file formats,
whether for research, CI, or other use cases.

Best,
Martin

Reply via email to