On 2023/06/15 16:24:44 Joris Van den Bossche wrote:
> Hi all,
>
> Bringing up https://github.com/apache/arrow/issues/35746 to the
> mailing list: this issue proposes to bump the default Parquet version
> we use for writing to Parquet files in the C++ library (and in the
> various bindings including pyarrow and R arrow) from the current
> default of "2.4" to "2.6".
>
> In practice, the only change is that the writer will, by default,
> write the Timestamp LogicalType with NANOS unit
> (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
)
> if your data uses timestamp("ns") (currently, such data gets coerced
> to microsecond resolution when writing to Parquet).
>
> In theory this could cause compatibility issues if the files you are
> writing need to be read by other Parquet implementations which don't
> yet support nanoseconds. But the Parquet format 2.6 was released in
> Sept 2018, and parquet-mr added support for it in 2018 as well.
>
> Unless there is pushback on this, we are currently planning to make
> this change for the upcoming Arrow 13.0.0 release.
>
> Best,
> Joris
>

In our current codebase, users can switch to all these formats:
1. Parquet 1.0
2. Parquet 2.0 (deprecated, similar to Parquet 2.6, might mean it could
support all kinds of 2.0 feature)
3. Parquet 2.4 (released in October 2017, enables UINT32 logical type)
4. Parquet 2.6 (released in September 2018, enables NANOS)

I think switching to 2.6 with nanos might break some legacy readers, but
currently most readers support
reading from NANOS, so I'm +1 with this proposal.

Best wishes,
Xuwei Fu

Reply via email to