Hi Martin,

Thank you so much for sharing this, this would be very helpful. I'll try
this out for FSST.
Would it be possible to also have the dataset characteristics documented
somewhere, data types, counts, nulls, sparsity etc (maybe a summary of
snapshot.json). It would greatly help picking a dataset for testing a code
change/encoding. (I went through the docs/v1)
Thanks again for sharing it, will try it out for new code changes and let
you know how it goes.

Warm Regards,
Arnav

On Fri, May 8, 2026 at 6:21 AM Martin Prammer via dev <
[email protected]> wrote:

> Hi all,
>
> Project link: https://github.com/spiraldb/raincloud
>
> Thanks for the warm reception at the Parquet sync! It was great to see
> support for the project's initial prototype.
>
> Raincloud is a pipeline that assembles a curated catalog of public datasets
> into Parquet files. I want Raincloud to address two needs: First, it should
> curate a collection of real-world datasets that we, as a community, agree
> are useful for evaluating file formats. Second, it should make accessing
> these files easy. I've selected over 200 datasets from Kaggle, Hugging
> Face, and directly hosted sources I'm familiar with (e.g., NYC Taxi and
> Public BI) to serve as an initial catalog for this project. Raincloud uses
> this catalog to fetch and process each dataset into a Parquet file. A TUI
> is bundled with the pipeline to explore the catalog, along with human- and
> AI-focused documentation to minimize setup friction.
>
> This project is very much an early effort; issues, PRs, and dataset
> suggestions are all welcome. Currently, Raincloud uses Python-based tooling
> to generate Parquet files, with an optional path to generate Vortex files.
> I want Raincloud to support additional formats and Parquet writers, which
> I'm less familiar with.
>
> The folks at Spiral have been highly supportive of this effort and
> generally want to maintain a "hands-off" attitude, which I'm grateful for.
> I want Raincloud to be useful for data-driven testing of file formats,
> whether for research, CI, or other use cases.
>
> Best,
> Martin
>

Reply via email to