jorisvandenbossche commented on code in PR #382: URL: https://github.com/apache/arrow-site/pull/382#discussion_r1272477529
########## _posts/2023-07-17-13.0.0-release.md: ########## @@ -0,0 +1,324 @@ +--- +layout: post +title: "Apache Arrow 13.0.0 Release" +date: "2023-07-17 00:00:00" +author: pmc +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +The Apache Arrow team is pleased to announce the 13.0.0 release. This covers +over 3 months of development work and includes [**XXX resolved issues**][1] +from [**YYY distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 12.0.0 release, Marco Neumann, Gang Wu, Mehmet Ozan Kabak and Kevin Gurney +have been invited to be committers. +Matt Topol, Jie Wen, Ben Baumgold and Dewey Dunnington have joined the +Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +The [run-end encoded layout](https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout) +has been added. This layout can allow data with long runs of duplicate values to be encoded and processed +efficiently. Initial support has been added for [C++](https://github.com/apache/arrow/pull/14179) and +[Go](https://github.com/apache/arrow/pull/14223). + +### C Device Data Interface + +An **experimental** new specification, the +[C Device Data Interface](https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html), +has been accepted for inclusion (GH-34971). It builds on the existing +C Data Interface to provide a runtime-agnostic zero-copy sharing mechanism +for Arrow data residing on non-CPU devices. + +Reference implementations of the C Device Data Interface will progressively +be added to the standard Arrow libraries after the 13.0.0 release. + +## Arrow Flight RPC notes + +Support for flagging ordered result sets to clients is now added. ([#34852](https://github.com/apache/arrow/issues/34852)) + +gRPC 1.30 is now the minimum supported version in C++/Python/R/etc. ([#34679](https://github.com/apache/arrow/issues/36479)) + +In C++, various methods now receive a full `ServerCallContext` ([#35442](https://github.com/apache/arrow/issues/35442), [#35377](https://github.com/apache/arrow/issues/35377)) and the context now exposes headers sent by the client ([#35375](https://github.com/apache/arrow/issues/35375)). + +## C++ notes + +### Building + +CMake 3.16 or later is now required for building Arrow C++ (GH-34921). + +Optimizations are not disabled anymore when the `RelWithDebInfo` build type +is selected (GH-35850). Furthermore, compiler flags can now properly be +customized per-build type using `ARROW_C_FLAGS_DEBUG`, `ARROW_CXX_FLAGS_DEBUG` +and related variables (GH-35870). + +### Acero + +Handling of unaligned buffers is input nodes can be configured programmatically +or by setting the environment variable `ACERO_ALIGNMENT_HANDLING`. The default +behavior is to warn when an unaligned buffer is detected (GH-35498). + +### Compute + +Several new functions have been added: +* aggregate functions "first", "last", "first_last" (GH-34911); +* vector functions "cumulative_prod", "cumulative_min", "cumulative_max" (GH-32190); +* vector function "pairwise_diff" (GH-35786). + +Sorting now works on dictionary arrays, with a much better performance than +the naive approach of sorting the decoded dictionary (GH-29887). Sorting also +works on struct arrays, and nested sort keys are supported using `FieldRed` (GH-33206). + +The `check_overflow` option has been removed from `CumulativeSumOptions` as +it was redundant with the availability of two different functions: +"cumulative_sum" and "cumulative_sum_checked" (GH-35789). + +Run-end encoded filters are efficiently supported (GH-35749). + +Duration types are supported with the "is_in" and "index_in" functions (GH-36047). +They can be multiplied with all integer types (GH-36128). + +"is_in" and "index_in" now cast their inputs more flexibly: they first attempt +to cast the value set to the input type, then in the other direction if the +former fails (GH-36203). + +Multiple bugs have been fixed in "utf8_slice_codeunits" when the `stop` option +is omitted (GH-36311). + +### Dataset + +A custom schema can now be passed when writing a dataset (GH-35730). The custom +schema can alter nullability or metadata information, but is not allowed to +change the datatypes written. + +### Filesystems + +The S3 filesystem now writes files in equal-sized chunks, for compatibility with +Cloudflare's "R2" Storage (GH-34363). + +A long-standing issue where S3 support could crash at shutdown because of resources +still being alive after S3 finalization has been fixed (GH-36346). Now, attempts +to use S3 resources (such as making filesystem calls) after S3 finalization should +result in a clean error. + +The GCS filesystem accepts a new option to set the project id (GH-36227). + +### IPC + +Nullability and metadata information for sub-fields of map types is now preserved +when deserializing Arrow IPC (GH-35297). + +### Orc + +The Orc adapter now maps Arrow field metadata to Orc type attributes when writing, +and vice-versa when reading (GH-35304). + +### Parquet + +It is now possible to write additional metadata while a `ParquetFileWriter` is +open (GH-34888). + +Writing a page index can be enabled selectively per-column (GH-34949). +In addition, page header statistics are not written anymore if the page +index is enabled for the given column (GH-34375), as the information would +be redundant and less efficiently accessed. + +Parquet writer properties allow specifying the sorting columns (GH-35331). +The user is responsible for ensuring that the data written to the file +actually complies with the given sorting. + +CRC computation has been implemented for v2 data pages (GH-35171). +It was already implemented for v1 data pages. + +Writing compliant nested types is now enabled by default (GH-29781). This +should not have any negative implication. + +Attempting to load a subset of an Arrow extension type is now forbidden +(GH-20385). Previously, if an extension type's storage is nested (for example +a "Point" extension type backed by a `struct<x: float64, y: float64>`), +it was possible to load selectively some of the columns of the storage type. + +### Substrait + +Support for various functions has been added: "stddev", "variance", "first", +"last" (GH-35247, GH-35506). + +Deserializing sorts is now supported (GH-32763). However, some features, +such as clustered sort direction or custom sort functions, are not +implemented. + +### Miscellaneous + +`FieldRef` sports additional methods to get a flattened version of nested +fields (GH-14946). Compared to their non-flattened counterparts, +the methods `GetFlattened`, `GetAllFlattened`, `GetOneFlattened` and +`GetOneOrNoneFlattened` combine a child's null bitmap with its ancestors' +null bitmaps such as to compute the field's overall logical validity bitmap. + +In other words, given the struct array `[null, {'x': null}, {'x': 5}]`, +`FieldRef("x")::Get` might return `[0, null, 5]` +while `FieldRef("y")::GetFlattened` will *always* return `[null, null, 5]`. + +`Scalar::hash()` has been fixed for sliced nested arrays (GH-35360). + +A new floating-point to decimal conversion algorithm exhibits much better +precision (GH-35576). + +It is now possible to cast between scalars of different list-like types +(GH-36309). + +## C# notes + +### Enhancements + +* The [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) is now supported in the .NET Apache.Arrow library. The main entry points are `CArrowArrayImporter.ImportArray`, `CArrowArrayExporter.ExportArray`, `CArrowArrayStreamImporter.ImportArrayStream`, and `CArrowArrayStreamExporter.ExportArrayStream` in the `Apache.Arrow.C` namespace. ([GH-33856](https://github.com/apache/arrow/issues/33856), [GH-33857](https://github.com/apache/arrow/issues/33857), [GH-36120](https://github.com/apache/arrow/issues/36120), and [GH-35809](https://github.com/apache/arrow/issues/35809)). + +* `ArrowBuffer.BitmapBuilder` adds `Append(ReadOnlySpan<byte> source, int validBits)` and `AppendRange(bool value, int length)` to improve performance of array concatenation ([GH-32605](https://github.com/apache/arrow/issues/32605)) + +### Bug Fixes + +* TotalBytes and TotalRecords are now being serialized in FlightInfo ([GH-35267](https://github.com/apache/arrow/issues/35267)) +## Go notes + +### Enhancements + +#### Arrow + +* Compute arithmetic functions are now available for Float16 ([GH-35162](https://github.com/apache/arrow/issues/35162)) +* Float16, Large* and Fixed types are all now supported by the CSV reader/writer ([GH-36105](https://github.com/apache/arrow/issues/36105) and [GH-36141](https://github.com/apache/arrow/issues/36141)) +* CSV Reader uses `AppendValueFromString` for extension types and properly reads empty values as null ([GH-35188](https://github.com/apache/arrow/issues/35188) and [GH-35190](https://github.com/apache/arrow/issues/35190)) +* [Substrait](https://github.com/substrait-io/substrait-go) expressions can now be executed using the Compute library ([GH-35652](https://github.com/apache/arrow/issues/35652)) +* You can now read back values from Dictionary Builders before finishing the array ([GH-35711](https://github.com/apache/arrow/issues/35711)) +* `MapType.ValueField` and `MapType.ValueType` are now deprecated in favor of `MapType.Elem().(*StructType)` ([GH-35909](https://github.com/apache/arrow/issues/35909)) +* Multiple equality functions which have been deprecated since v9 have now been removed (Such as `array.ArraySliceEqual` in favor of `array.SliceEqual`) ([GH-36198](https://github.com/apache/arrow/issues/36198)) +* `ValueStr` method on Timestamp arrays now includes the zone in the output ([GH-36568](https://github.com/apache/arrow/issues/36568)) +* *BREAKING CHANGE* `FixedSizeListBuilder.AppendNull` no longer requires manually appending nulls to the underlying list ([GH-35482](https://github.com/apache/arrow/issues/35482)) + +#### Flight + +* FlightSQL driver supports non-prepared queries now ([GH-35136](https://github.com/apache/arrow/issues/35136)) + +#### Parquet + +* Error messages in row group writer have been improved ([GH-36319](https://github.com/apache/arrow/issues/36319)) + +### Bug Fixes + +* Cross architecture build failures with v12.0.1 have been fixed ([GH-36052](https://github.com/apache/arrow/issues/36052)) + +#### Arrow + +* It is now possible to build the Arrow Go lib using tinygo for building smaller WASM binaries ([GH-32832](https://github.com/apache/arrow/issues/32832)) +* `Fields` method for Schema and StructType now returns a copy of the slice to ensure immutability ([GH-35306](https://github.com/apache/arrow/issues/35306) and [GH-35866](https://github.com/apache/arrow/issues/35866)) +* `array.ApproxEqual` for Maps now allows entries for a given element to be presented in any order ([GH-35828](https://github.com/apache/arrow/issues/35828)) +* Fix issues with decimal256 arrays ([GH-35911](https://github.com/apache/arrow/issues/35911), [GH-35965](https://github.com/apache/arrow/issues/35965), and [GH-35975](https://github.com/apache/arrow/issues/35975)) +* StructType now allows duplicate field names correctly ([GH-36014](https://github.com/apache/arrow/issues/36014)) + +#### Flight + +* Fix crash in client middleware ([GH-35240](https://github.com/apache/arrow/issues/35240)) + +#### Parquet + +* Various memory leaks addressed in pqarrow package ([GH-35015](https://github.com/apache/arrow/issues/35015)) +* Fixed panic for `ListOf(types)` if null ([GH-35684](https://github.com/apache/arrow/issues/35684)) + + +## Java notes + +The JNI bindings for Arrow Dataset now support execute [Substrait](https://substrait.io/) plans via the [Acero](https://arrow.apache.org/docs/dev/cpp/streaming_execution.html) query engine. ([#34223](https://github.com/apache/arrow/issues/34223)) + +Arrow packages that depend on Netty (most notably, `arrow-memory-netty`, but also Arrow Flight) now require Netty 4.1.94.Final at a minimum. ([#36209](https://github.com/apache/arrow/issues/36209)) + +`VectorSchemaRoot#slice` now always makes a copy, including when the slice covers all rows (previously it did not make a copy in this case). This is a potentially-breaking change if your application depended on the old behavior. ([#35275](https://github.com/apache/arrow/issues/35275)) + +Debug info for allocations is no longer automatically enabled when assertions are enabled (e.g. when running unit tests). Instead, support must be explicitly enabled. This is not quite a breaking change, but may be surprising if you are used to using this information while debugging tests. However, performance should be greatly improved while running tests. ([#34338](https://github.com/apache/arrow/issues/34338)) + +Support for the upcoming Java 21 was added, though we do not yet test this in CI ([#5053](https://github.com/apache/arrow/issues/35053)). The JNI bindings for Arrow Dataset now expose JSON support ([#36421](https://github.com/apache/arrow/issues/36421)). Dictionary replacement is now supported when writing the IPC stream format ([#18547](https://github.com/apache/arrow/issues/18547)). + +## JavaScript notes + +## Python notes + +Compatibility notes: + +* The default format version for Parquet has been bumped from 2.4 to 2.6 [GH-35746](https://github.com/apache/arrow/issues/35746) Review Comment: ```suggestion * The default format version for Parquet has been bumped from 2.4 to 2.6 [GH-35746](https://github.com/apache/arrow/issues/35746). In practice, this means that nanosecond timestamps now preserve its resolution instead of being converted to microseconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org