Point cloud data in Parquet...cool!

This sounds similar to the concept of Field metadata in Arrow, where we
also have extension types but people do use ad-hoc metadata to convey
non-extension type information like you are describing. If you are using an
Arrow implementation to write/read the Parquet file, the embedded Arrow
schema may already be able to roundtrip that information.

If this is specific to a domain (e.g., LiDAR), you could also invent a
top-level key/value metadata standard (this is what GeoParquet did before
Parquet GEOMETRY/GEOGRAPHY). A wider variety of Parquet
implementations/versions would be able to access this information than if
the Parquet thrift were updated.

Many of the techniques used to encode LiDAR data/make rows more compact can
also be achieved by expanding things like bitpacked fields into multiple
columns or resolving things like scaled integers into an existing Parquet
type. Parquet's encodings and compression may be able to accomplish
something similar to how this is frequently stored in point cloud native
formats (although would make it harder to roundtrip).

Apologies if I'm missing context here!

Cheers,

-dewey

On Fri, Oct 31, 2025 at 2:04 PM Andrew Bell <[email protected]>
wrote:

> On Fri, Oct 31, 2025 at 1:14 PM Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Andrew,
> > If this is to support new type (point cloud data), is there a reason to
> > choose a key value member to the schema over something like the extension
> > type proposal [1]
> >
>
> In some ways it's no different -- you're providing some data to ride along
> with a column.  The extension type has the advantage of providing an
> indirection which *might* be useful for the case when you have many columns
> of the same type, though this seems a pretty specific use case and adds
> additional complexity. However, extension types provide no hint of meaning
> to be found in the "serialization" field (JSON is suggested, which could
> provide keys, but would also require an additional parsing step).
>
> Allowing the addition of data to the existing SchemaElement is trivially
> simple and more flexible. Users could add whatever data they like to
> annotate their schema element without introducing anything to the type
> system. For example, one could add a description to an integer element
> without creating an "Integer with Description" type or provide language
> information about a string without creating a type "String in French".
>
> The extension type proposal suggests that readers will be modified to
> support the extension types.  Adding metadata directly to the SchemaElement
> simply allows code *outside* of a Parquet reader to use the information for
> its own purpose -- a reader only needs to provide an API to access the
> metadata to be useful.
>
> Some examples from point cloud data:
>
> - Integers to which a scale and offset are applied to create a nominal
> value (the current integer-based scale/offset are insufficient).
> - Units for many types.
> - GPS times are stored in several ways -- having metadata which may or may
> not include an offset allows for proper interpretation.
> - Descriptions of bit fields packed into integers.
> - Indication that "return" numbers are synthetically generated. (A laser
> pulse can create multiple points, each known as a "return").
>
> There's certainly nothing that precludes doing both extension types and
> adding metadata support for SchemaElements.
>
> --
> Andrew Bell
> [email protected]
>

Reply via email to