Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-22 Thread via GitHub


lidavidm merged PR #718:
URL: https://github.com/apache/arrow-java/pull/718


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-22 Thread via GitHub


martin-traverse commented on PR #718:
URL: https://github.com/apache/arrow-java/pull/718#issuecomment-2822728753

   > Looks like there are some lint errors to be fixed
   
   Apologies - I have reapplied spotless, should be ok now!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-20 Thread via GitHub


lidavidm commented on PR #718:
URL: https://github.com/apache/arrow-java/pull/718#issuecomment-2817591251

   Looks like there are some lint errors to be fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-20 Thread via GitHub


lidavidm commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2051858753


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowUtils.java:
##
@@ -277,11 +333,17 @@ private static Consumer createConsumer(
 break;
   case BYTES:
 if (logicalType instanceof LogicalTypes.Decimal) {
-  arrowType = createDecimalArrowType((LogicalTypes.Decimal) 
logicalType);
+  LogicalTypes.Decimal decimalType = (LogicalTypes.Decimal) 
logicalType;
+  arrowType = createDecimalArrowType(decimalType, schema);
   fieldType =
   new FieldType(nullable, arrowType, /* dictionary= */ null, 
getMetaData(schema));
   vector = createVector(consumerVector, fieldType, name, allocator);
-  consumer = new 
AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector);
+  if (decimalType.getPrecision() <= 38) {

Review Comment:
   I think FIXED is okay to allow round-trip by default even if it's not 
technically as compact.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2050405594


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowConfig.java:
##
@@ -81,4 +114,8 @@ public DictionaryProvider.MapDictionaryProvider 
getProvider() {
   public Set getSkipFieldNames() {
 return skipFieldNames;
   }
+
+  public boolean isHandleNullable() {

Review Comment:
   This is now part of the legacyMode parameter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on PR #718:
URL: https://github.com/apache/arrow-java/pull/718#issuecomment-2815088209

   I have removed the breaking changes text from the headline text but am not 
able to remove the label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2050413561


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowUtils.java:
##
@@ -277,11 +333,17 @@ private static Consumer createConsumer(
 break;
   case BYTES:
 if (logicalType instanceof LogicalTypes.Decimal) {
-  arrowType = createDecimalArrowType((LogicalTypes.Decimal) 
logicalType);
+  LogicalTypes.Decimal decimalType = (LogicalTypes.Decimal) 
logicalType;
+  arrowType = createDecimalArrowType(decimalType, schema);
   fieldType =
   new FieldType(nullable, arrowType, /* dictionary= */ null, 
getMetaData(schema));
   vector = createVector(consumerVector, fieldType, name, allocator);
-  consumer = new 
AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector);
+  if (decimalType.getPrecision() <= 38) {

Review Comment:
   For FIXED decimals I am using the fixedSize to choose the decimal type. For 
BYTES decimals yes they would just come back as the smaller type if the 
precision fits.
   
   I used FIXED as the default output for decimals in the producers, because it 
is closer to the Arrow representation, but on reflection Avro as a format is 
very focused on keeping data compact, maybe BYTES makes more sense. It seems to 
be the default choice in Avro. Do you think we should go with that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2050407127


##
adapter/avro/src/test/java/org/apache/arrow/adapter/avro/RoundTripSchemaTest.java:
##


Review Comment:
   I have factored out the common code as a helper. Don't think it can go all 
the way to being parameterised because the types need to be set up differently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2050405239


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/producers/AvroNullableProducer.java:
##
@@ -21,8 +21,8 @@
 import org.apache.avro.io.Encoder;
 
 /**
- * Producer wrapper which producers nullable types to an avro encoder. Write 
the data to the
- * underlying {@link FieldVector}.
+ * Producer wrapper which producers nullable types to an avro encoder. Reed 
data from the underlying

Review Comment:
   ;-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2050404883


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/ArrowToAvroUtils.java:
##
@@ -332,7 +332,17 @@ private static  T buildBaseTypeSchema(
 
   case List:
   case FixedSizeList:
-return buildArraySchema(builder.array(), field, namespace);
+// Arrow uses "$data$" as the field name for list items, that is not a 
valid Avro name
+Field itemField = field.getChildren().get(0);
+if (ListVector.DATA_VECTOR_NAME.equals(itemField.getName())) {

Review Comment:
   Hm, I think for list / map types using the constant defined names for 
children makes sense, with "item" instead of "$data$" for list items. More 
generally, we could normalise illegal chars to "_" to match the [Avro name 
rules](https://avro.apache.org/docs/1.12.0/specification/#names). Per my 
understanding similar rules are already enforced in C++, but are not part of 
the Arrow spec or Java implementation.
   
   Very happy to put the normalisation in, it's probably a more useful 
behaviour than throwing an error in the adapter. Would you like me to do it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-18 Thread via GitHub


martin-traverse commented on PR #718:
URL: https://github.com/apache/arrow-java/pull/718#issuecomment-2814996409

   Ok, here is an update. I have put a flag "legacyMode" in the config object 
and used that to control all the places where the old logic is impacted. I have 
allowed decimal 256 to come through in legacy mode and also timestamp-nanos 
with the old semantics. The local-timestamp-xxx types do not come through in 
legacy mode, because timestamp-xxx is already treated as local. I have replaced 
the original code for the logical types test and those are all passing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-15 Thread via GitHub


martin-traverse commented on PR #718:
URL: https://github.com/apache/arrow-java/pull/718#issuecomment-2803951943

   > I think in the interest of trying to keep semver, we should avoid breaking 
changes if possible. Any thoughts @jbonofre @laurentgo? Or we could just call 
the next release 19.0.0...
   > 
   > If it helps we could just have a single flag for "old" behavior?
   
   A major version bump feels a bit extreme for something as small as this! Let 
me try the single flag approach. There might be complications with the change 
for zone-aware vs local timestamps, because the types are different, but I 
think so long as the check happens before the types are decided it should be ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



[PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-15 Thread via GitHub


martin-traverse opened a new pull request, #718:
URL: https://github.com/apache/arrow-java/pull/718

   Hi @lidavidm - here is part 2 in my Avro series, apologies for the delay, 
it's the usual work / contention story!
   
   ## What's Changed
   
   This PR relates to #698 and is the second in a series intended to provide 
full Avro read / write support in native Java. It adds round-trip tests for 
both schemas (Arrow schema -> Avro -> Arrow) and data (Arrow VSR -> Avro block 
-> Arrow VSR). It also adds a number of fixes and improvements to the Avro 
Consumers so that data arrives back in its original form after a round trip. 
The main changes are:
   
   * Added a top level method in AvroToArrow to convert Avro schema directly to 
Arrow schema (this may exist elsewhere, but is needed to provide an API that 
matches the logic of this implementation)
   * Avro unions of [ type, null ] or [ null, type ] now have special handling, 
these are interpreted as a single nullable type rather than a union. Because 
this change is quite significant I added a flag "handleNullable" to the 
AvroToArrowConfig object. It is debatable whether the old behaviour (treating 
these as literal unions) is ever the expected result, in which case this flag 
could be removed. Unions with more than 2 elements are interpreted literally 
(but, per #108, in practice Java's current Union implementation is probably not 
usable with Avro atm).
   * Added support for new logical types (decimal 256, timestamp nano and 3 
local timestamp types)
   * Existing timestamp-mills and timestamp-micros times now interpreted as 
zone-aware (previously they were interpreted as local, but now the local 
timestamp types are interpreted as local - this is a breaking change but I 
think it is correct per the [Avro 
spec](https://avro.apache.org/docs/1.12.0/specification/#timestamps))
   * Removed namespaces from generated Arrow field names in complex types. E.g. 
the Avro field myNamepsace.outerRecord.structField.intField should be called 
just "intField" inside the Arrow struct. This doesn't affect the skip field 
logic, which still works using the qualified names. This is a breaking change.
   * Remove unexpected metadata in generated Arrow fields (empty alias lists 
and attributes interpreted as part of the field schema). This is a breaking 
change.
   * Use the expected child vector names for Arrow LIST and MAP types when 
reading. For LIST, the default child vector is called "$data$" which is illegal 
in Avro, so the child field name is also changed to "item" in the producers. 
This is a breaking change.
   
   **This contains breaking changes.**
   
   More of the breaking items above could be moved behind flags in the config 
object if necessary. The change for zone-aware vs local timestamps cannot be 
(not easily at least), because the types are different. On balance my view was 
to treat most of these as "fixes", but, very happy to take some guidance on 
this point!!
   
   Closes #698 .
   
   This change is meant to allow for round trip of schemas and individual Avro 
data blocks (one Avro data block -> one VSR). File-level capabilities are not 
included. I have not included anything to recycle the VSR as part of the read 
API, this feels like it belongs with the file-level piece. Also I have not done 
anything specific for enums / dict encoding as of yet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-15 Thread via GitHub


github-actions[bot] commented on PR #718:
URL: https://github.com/apache/arrow-java/pull/718#issuecomment-2802462275

   
   
   Thank you for opening a pull request!
   
   Please label the PR with one or more of:
   
   - bug-fix
   - chore
   - dependencies
   - documentation
   - enhancement
   
   Also, add the 'breaking-change' label if appropriate.
   
   See 
[CONTRIBUTING.md](https://github.com/apache/arrow-java/blob/main/CONTRIBUTING.md)
 for details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-698: Improve and fix Avro read consumers [arrow-java]

2025-04-14 Thread via GitHub


lidavidm commented on code in PR #718:
URL: https://github.com/apache/arrow-java/pull/718#discussion_r2043326609


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/producers/AvroNullableProducer.java:
##
@@ -21,8 +21,8 @@
 import org.apache.avro.io.Encoder;
 
 /**
- * Producer wrapper which producers nullable types to an avro encoder. Write 
the data to the
- * underlying {@link FieldVector}.
+ * Producer wrapper which producers nullable types to an avro encoder. Reed 
data from the underlying

Review Comment:
   ```suggestion
* Producer wrapper which produces nullable types to an avro encoder. Read 
data from the underlying
   ```



##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/ArrowToAvroUtils.java:
##
@@ -332,7 +332,17 @@ private static  T buildBaseTypeSchema(
 
   case List:
   case FixedSizeList:
-return buildArraySchema(builder.array(), field, namespace);
+// Arrow uses "$data$" as the field name for list items, that is not a 
valid Avro name

Review Comment:
   The funny thing is, arrow-java _shouldn't be doing that_, it was just never 
corrected...



##
adapter/avro/src/test/java/org/apache/arrow/adapter/avro/RoundTripSchemaTest.java:
##


Review Comment:
   Is there an opportunity to structure this as a parameterized test?



##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/ArrowToAvroUtils.java:
##
@@ -332,7 +332,17 @@ private static  T buildBaseTypeSchema(
 
   case List:
   case FixedSizeList:
-return buildArraySchema(builder.array(), field, namespace);
+// Arrow uses "$data$" as the field name for list items, that is not a 
valid Avro name
+Field itemField = field.getChildren().get(0);
+if (ListVector.DATA_VECTOR_NAME.equals(itemField.getName())) {

Review Comment:
   Or just normalize all field names to something consistent in Avro?



##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowUtils.java:
##
@@ -277,11 +333,17 @@ private static Consumer createConsumer(
 break;
   case BYTES:
 if (logicalType instanceof LogicalTypes.Decimal) {
-  arrowType = createDecimalArrowType((LogicalTypes.Decimal) 
logicalType);
+  LogicalTypes.Decimal decimalType = (LogicalTypes.Decimal) 
logicalType;
+  arrowType = createDecimalArrowType(decimalType, schema);
   fieldType =
   new FieldType(nullable, arrowType, /* dictionary= */ null, 
getMetaData(schema));
   vector = createVector(consumerVector, fieldType, name, allocator);
-  consumer = new 
AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector);
+  if (decimalType.getPrecision() <= 38) {

Review Comment:
   Hmm, it's technically possible to have a decimal256 with a smaller precision 
though



##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/ArrowToAvroUtils.java:
##
@@ -332,7 +332,17 @@ private static  T buildBaseTypeSchema(
 
   case List:
   case FixedSizeList:
-return buildArraySchema(builder.array(), field, namespace);
+// Arrow uses "$data$" as the field name for list items, that is not a 
valid Avro name
+Field itemField = field.getChildren().get(0);
+if (ListVector.DATA_VECTOR_NAME.equals(itemField.getName())) {

Review Comment:
   Do we perhaps want to check for invalid names more generally and 
mangle/normalize them?



##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowConfig.java:
##
@@ -81,4 +114,8 @@ public DictionaryProvider.MapDictionaryProvider 
getProvider() {
   public Set getSkipFieldNames() {
 return skipFieldNames;
   }
+
+  public boolean isHandleNullable() {

Review Comment:
   nit: could we perhaps get a more descriptive name for this parameter 
overall? "handleUnionOfNullAsNullable"? (As enterprise-java-y as that is...)



##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowUtils.java:
##
@@ -277,11 +333,17 @@ private static Consumer createConsumer(
 break;
   case BYTES:
 if (logicalType instanceof LogicalTypes.Decimal) {
-  arrowType = createDecimalArrowType((LogicalTypes.Decimal) 
logicalType);
+  LogicalTypes.Decimal decimalType = (LogicalTypes.Decimal) 
logicalType;
+  arrowType = createDecimalArrowType(decimalType, schema);
   fieldType =
   new FieldType(nullable, arrowType, /* dictionary= */ null, 
getMetaData(schema));
   vector = createVector(consumerVector, fieldType, name, allocator);
-  consumer = new 
AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector);
+  if (decimalType.getPrecision() <= 38) {

Review Comment:
   I guess in that case it would round-trip to the smaller type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log o