Re: Datasets Project - Raincloud

Martin Prammer via dev Mon, 11 May 2026 09:12:54 -0700

Hi Arnav,

Thanks for taking a look and for your suggestions! Currently, the only
discoverability mechanism is manual inspection of each dataset via the TUI
("uv sync --extra tui --inexact && python -m scripts.pipeline.browse");
scroll through the datasets, and then press "c" over a highlighted dataset
to get file-level column statistics (type, nulls, min, max). I'll look into
showing a histogram or a similar visualization with each column this week.


Best,
Martin


On Sun, May 10, 2026 at 4:39 AM Arnav Balyan <[email protected]> wrote:

> Hi Martin,
>
> Thank you so much for sharing this, this would be very helpful. I'll try
> this out for FSST.
> Would it be possible to also have the dataset characteristics documented
> somewhere, data types, counts, nulls, sparsity etc (maybe a summary of
> snapshot.json). It would greatly help picking a dataset for testing a code
> change/encoding. (I went through the docs/v1)
> Thanks again for sharing it, will try it out for new code changes and let
> you know how it goes.
>
> Warm Regards,
> Arnav
>
> On Fri, May 8, 2026 at 6:21 AM Martin Prammer via dev <
> [email protected]> wrote:
>
>> Hi all,
>>
>> Project link: https://github.com/spiraldb/raincloud
>>
>> Thanks for the warm reception at the Parquet sync! It was great to see
>> support for the project's initial prototype.
>>
>> Raincloud is a pipeline that assembles a curated catalog of public
>> datasets
>> into Parquet files. I want Raincloud to address two needs: First, it
>> should
>> curate a collection of real-world datasets that we, as a community, agree
>> are useful for evaluating file formats. Second, it should make accessing
>> these files easy. I've selected over 200 datasets from Kaggle, Hugging
>> Face, and directly hosted sources I'm familiar with (e.g., NYC Taxi and
>> Public BI) to serve as an initial catalog for this project. Raincloud uses
>> this catalog to fetch and process each dataset into a Parquet file. A TUI
>> is bundled with the pipeline to explore the catalog, along with human- and
>> AI-focused documentation to minimize setup friction.
>>
>> This project is very much an early effort; issues, PRs, and dataset
>> suggestions are all welcome. Currently, Raincloud uses Python-based
>> tooling
>> to generate Parquet files, with an optional path to generate Vortex files.
>> I want Raincloud to support additional formats and Parquet writers, which
>> I'm less familiar with.
>>
>> The folks at Spiral have been highly supportive of this effort and
>> generally want to maintain a "hands-off" attitude, which I'm grateful for.
>> I want Raincloud to be useful for data-driven testing of file formats,
>> whether for research, CI, or other use cases.
>>
>> Best,
>> Martin
>>
>

Re: Datasets Project - Raincloud

Reply via email to