[DISCUSS] Variant shredding specification

2024-06-03 Thread Gene Pang
Hi all,

We have been working on the Variant data type, which is designed to store
and process semi-structured data efficiently, even with heterogeneous
values. Users can store and process semi-structured data in a flexible way,
without having to specify or know any fixed schema on write. Variant data
is encoded in a self-describing format
, and
the binary format uses offset-based encoding to speed up the navigation
performance.

To further improve performance, we are also working on shredding, which is
the process of extracting some of the Variant fields from the binary, and
storing them in separate columns. We have written a specification for Variant
shredding  to augment the
existing Variant specification.

The shredding benefits include:
- more compact data encoding
- min/max statistics for data skipping
- I/O and CPU savings from pruning unnecessary fields not accessed by a
query

Please take a look at the shredding specification PR
 and leave github comments and
suggestions. Your feedback would be greatly appreciated!

Thanks,
Gene


[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan
Hi all,

To enable wide-scale community testing of the upcoming Spark 4.0 release,
the Apache Spark community has posted a preview release of Spark 4.0. This
preview is not a stable release in terms of either API or functionality,
but it is meant to give the community early access to try the code that
will become Spark 4.0. If you would like to test the release, please
download it, and send feedback using either the mailing lists or JIRA.

There are a lot of exciting new features added to Spark 4.0, including ANSI
mode by default, Python data source, polymorphic Python UDTF, string
collation support, new VARIANT data type, streaming state store data
source, structured logging, Java 17 by default, and many more.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 4.0.0-preview1, head over to the download page:
https://archive.apache.org/dist/spark/spark-4.0.0-preview1 . It's also
available in PyPI, with version name "4.0.0.dev1".

Thanks,

Wenchen