martin-traverse opened a new pull request, #718: URL: https://github.com/apache/arrow-java/pull/718
Hi @lidavidm - here is part 2 in my Avro series, apologies for the delay, it's the usual work / contention story! ## What's Changed This PR relates to #698 and is the second in a series intended to provide full Avro read / write support in native Java. It adds round-trip tests for both schemas (Arrow schema -> Avro -> Arrow) and data (Arrow VSR -> Avro block -> Arrow VSR). It also adds a number of fixes and improvements to the Avro Consumers so that data arrives back in its original form after a round trip. The main changes are: * Added a top level method in AvroToArrow to convert Avro schema directly to Arrow schema (this may exist elsewhere, but is needed to provide an API that matches the logic of this implementation) * Avro unions of [ type, null ] or [ null, type ] now have special handling, these are interpreted as a single nullable type rather than a union. Because this change is quite significant I added a flag "handleNullable" to the AvroToArrowConfig object. It is debatable whether the old behaviour (treating these as literal unions) is ever the expected result, in which case this flag could be removed. Unions with more than 2 elements are interpreted literally (but, per #108, in practice Java's current Union implementation is probably not usable with Avro atm). * Added support for new logical types (decimal 256, timestamp nano and 3 local timestamp types) * Existing timestamp-mills and timestamp-micros times now interpreted as zone-aware (previously they were interpreted as local, but now the local timestamp types are interpreted as local - this is a breaking change but I think it is correct per the [Avro spec](https://avro.apache.org/docs/1.12.0/specification/#timestamps)) * Removed namespaces from generated Arrow field names in complex types. E.g. the Avro field myNamepsace.outerRecord.structField.intField should be called just "intField" inside the Arrow struct. This doesn't affect the skip field logic, which still works using the qualified names. This is a breaking change. * Remove unexpected metadata in generated Arrow fields (empty alias lists and attributes interpreted as part of the field schema). This is a breaking change. * Use the expected child vector names for Arrow LIST and MAP types when reading. For LIST, the default child vector is called "$data$" which is illegal in Avro, so the child field name is also changed to "item" in the producers. This is a breaking change. **This contains breaking changes.** More of the breaking items above could be moved behind flags in the config object if necessary. The change for zone-aware vs local timestamps cannot be (not easily at least), because the types are different. On balance my view was to treat most of these as "fixes", but, very happy to take some guidance on this point!! Closes #698 . This change is meant to allow for round trip of schemas and individual Avro data blocks (one Avro data block -> one VSR). File-level capabilities are not included. I have not included anything to recycle the VSR as part of the read API, this feels like it belongs with the file-level piece. Also I have not done anything specific for enums / dict encoding as of yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org