Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-07-07 Thread via GitHub


lidavidm merged PR #779:
URL: https://github.com/apache/arrow-java/pull/779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-07-07 Thread via GitHub


martin-traverse commented on code in PR #779:
URL: https://github.com/apache/arrow-java/pull/779#discussion_r2190741283


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/producers/DictionaryDecodingProducer.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.arrow.adapter.avro.producers;
+
+import java.io.IOException;
+import org.apache.arrow.vector.BaseIntVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.avro.io.Encoder;
+
+/**
+ * Producer that produces decoded values from a dictionary-encoded {@link 
BaseIntVector}, writes

Review Comment:
   Hi @lidavidm - sorry I totally missed this comment! I have pushed a change 
with updated wording.
   
   Please let me know if you are happy to merge or if you would like any more 
changes. Meanwhile I will open the PR for Avro container files - that's the bit 
that will really make all this usable!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-23 Thread via GitHub


lidavidm commented on code in PR #779:
URL: https://github.com/apache/arrow-java/pull/779#discussion_r2160568006


##
adapter/avro/src/main/java/org/apache/arrow/adapter/avro/producers/DictionaryDecodingProducer.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.arrow.adapter.avro.producers;
+
+import java.io.IOException;
+import org.apache.arrow.vector.BaseIntVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.avro.io.Encoder;
+
+/**
+ * Producer that produces decoded values from a dictionary-encoded {@link 
BaseIntVector}, writes

Review Comment:
   Wording nit, but this sounds like the encoded values are ints; instead the 
encoded values are arbitrary and the encod**ing** is ints, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-11 Thread via GitHub


lidavidm commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2964589014

   Thanks - I will try to get to reviewing this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-07 Thread via GitHub


martin-traverse commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2953028800

   Hi @lidavidm  - I have added the dictionary decoding producer, which turned 
out to be very simple. Now any dictionary encoded fields that are not valid 
Avro enums will be automatically decoded and output as their concrete type. 
This does require running a regex over the dictionary entries, but that only 
has to happen once when the producers are set up.
   
   I do think we will need to change the enum read, so dictionaries are 
populated during the schema phase rather than the data phase in order to read 
whole files with multiple blocks. I'd like to keep that change back and do it 
as part of the next PR, which will be read / write for whole files.
   
   Assuming you are happy with both these points then I think this PR is ready 
for review  :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-03 Thread via GitHub


martin-traverse commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2933969515

   > So essentially, the "actual" vector would be the dictionary values, and 
the index vector would contain the "original" encoded vector? I think that 
sounds OK to me.
   
   Yes - essentially I want to avoid replicating the type-specific logic. I 
tried it out and looked ok, just not sure what happens with resetValueVector() 
though. Another option could be a DictionaryDecodingProducer, then the 
vector type for that producer would be the index type, with a concrete producer 
as a child. Dictionaries can't be updated in Avro anyway I think it works 
:) Will update the PR later for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-02 Thread via GitHub


lidavidm commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2933011950

   (You could also perhaps create a vector implementation that does this 
unwrapping? Though likely that's more work.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-02 Thread via GitHub


lidavidm commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2933011515

   > On reflection, simply decoding a dict-encoded vector is not an option 
because it will leak. I have pushed an update that will reject dict-encoded 
fields that are not valid Avro enums (check is during schema production).
   > 
   > I do think automatically decoding dict-encoded fields would be a nicer 
behaviour. Should we perhaps think about updating the producer pattern to allow 
for this? E.g. every producer could take an optional index vector, of type 
BaseIntVector. If that is present, current index is mapped through the index 
vector which resolves the dictionary index. Would this be ok?
   
   So essentially, the "actual" vector would be the dictionary values, and the 
index vector would contain the "original" encoded vector? I think that sounds 
OK to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-01 Thread via GitHub


martin-traverse commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2927707540

   On reflection, simply decoding a dict-encoded vector is not an option 
because it will leak. I have pushed an update that will reject dict-encoded 
fields that are not valid Avro enums (check is during schema production).
   
   I do think automatically decoding dict-encoded fields would be a nicer 
behaviour. Should we perhaps think about updating the producer pattern to allow 
for this? E.g. every producer could take an optional index vector, of type 
BaseIntVector. If that is present, current index is mapped through the index 
vector which resolves the dictionary index. Would this be ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-731: Avro adapter, output dictionary-encoded fields as enums [arrow-java]

2025-06-01 Thread via GitHub


github-actions[bot] commented on PR #779:
URL: https://github.com/apache/arrow-java/pull/779#issuecomment-2927605282

   
   
   Thank you for opening a pull request!
   
   Please label the PR with one or more of:
   
   - bug-fix
   - chore
   - dependencies
   - documentation
   - enhancement
   
   Also, add the 'breaking-change' label if appropriate.
   
   See 
[CONTRIBUTING.md](https://github.com/apache/arrow-java/blob/main/CONTRIBUTING.md)
 for details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]