[PR] GH-698: Improve and fix Avro read consumers [arrow-java]

via GitHub Tue, 15 Apr 2025 09:07:42 -0700


martin-traverse opened a new pull request, #718:
URL: https://github.com/apache/arrow-java/pull/718


   Hi @lidavidm - here is part 2 in my Avro series, apologies for the delay, 
it's the usual work / contention story!
   
   ## What's Changed
   
   This PR relates to #698 and is the second in a series intended to provide 
full Avro read / write support in native Java. It adds round-trip tests for 
both schemas (Arrow schema -> Avro -> Arrow) and data (Arrow VSR -> Avro block 
-> Arrow VSR). It also adds a number of fixes and improvements to the Avro 
Consumers so that data arrives back in its original form after a round trip. 
The main changes are:
   
   * Added a top level method in AvroToArrow to convert Avro schema directly to 
Arrow schema (this may exist elsewhere, but is needed to provide an API that 
matches the logic of this implementation)
   * Avro unions of [ type, null ] or [ null, type ] now have special handling, 
these are interpreted as a single nullable type rather than a union. Because 
this change is quite significant I added a flag "handleNullable" to the 
AvroToArrowConfig object. It is debatable whether the old behaviour (treating 
these as literal unions) is ever the expected result, in which case this flag 
could be removed. Unions with more than 2 elements are interpreted literally 
(but, per #108, in practice Java's current Union implementation is probably not 
usable with Avro atm).
   * Added support for new logical types (decimal 256, timestamp nano and 3 
local timestamp types)
   * Existing timestamp-mills and timestamp-micros times now interpreted as 
zone-aware (previously they were interpreted as local, but now the local 
timestamp types are interpreted as local - this is a breaking change but I 
think it is correct per the [Avro 
spec](https://avro.apache.org/docs/1.12.0/specification/#timestamps))
   * Removed namespaces from generated Arrow field names in complex types. E.g. 
the Avro field myNamepsace.outerRecord.structField.intField should be called 
just "intField" inside the Arrow struct. This doesn't affect the skip field 
logic, which still works using the qualified names. This is a breaking change.
   * Remove unexpected metadata in generated Arrow fields (empty alias lists 
and attributes interpreted as part of the field schema). This is a breaking 
change.
   * Use the expected child vector names for Arrow LIST and MAP types when 
reading. For LIST, the default child vector is called "$data$" which is illegal 
in Avro, so the child field name is also changed to "item" in the producers. 
This is a breaking change.
   
   **This contains breaking changes.**
   
   More of the breaking items above could be moved behind flags in the config 
object if necessary. The change for zone-aware vs local timestamps cannot be 
(not easily at least), because the types are different. On balance my view was 
to treat most of these as "fixes", but, very happy to take some guidance on 
this point!!
   
   Closes #698 .
   
   This change is meant to allow for round trip of schemas and individual Avro 
data blocks (one Avro data block -> one VSR). File-level capabilities are not 
included. I have not included anything to recycle the VSR as part of the read 
API, this feels like it belongs with the file-level piece. Also I have not done 
anything specific for enums / dict encoding as of yet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] GH-698: Improve and fix Avro read consumers [arrow-java]

Reply via email to