dntjr8096 opened a new issue, #11367:
URL: https://github.com/apache/iceberg/issues/11367
### Apache Iceberg version
1.4.3
### Query engine
Impala
### Please describe the bug 🐞
When migrating a Hive-Parquet table written via Impala or Hive to Iceberg
using the Spark command CALL catalog.system.migrate('hive_table'), reading the
data in Spark SQL fails due to schema compatibility issues.
1. Some Parquet-producing systems (e.g., Impala, Hive, older versions of
Spark SQL) do not differentiate between binary data and strings when writing
the Parquet schema. Spark SQL offers a flag to interpret binary data as strings
to provide compatibility with these systems.
2. If data was written to a Hive-Parquet table (e.g., hive_table) using
Impala, and the table had two columns (col1 as string, and col2 as string), the
Parquet row groups show null for the logical type:
```
* Table Name: hive_table / parquet row groups info*
column_name: col1
physical type: binary
logical type: null
column_name: col2
physical type: binary
logical type: null
```
3. In Iceberg, when trying to read Parquet data in Spark, the
`GenericArrowVectorAccessorFactory` generates an accessor as
`DictionaryBinaryAccessor`. Since `DictionaryBinaryAccessor` does not implement
the getUTF8String function, reading the data in Spark results in an error.
ex.
```java
spark.sql("CALL catalog.sytem.migrate('hive_table')")
spark.sql("select * from hive_table").limit(10).show()
java.lang.UnsupportedOperationException: Unsupported type: UTF8String
at
org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:81)
at
org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:138)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
source)
...
```
As an alternative solution, if the PARQUET_ANNOTATE_STRINGS_UTF8 query
option is enabled in Impala (version 2.6 or higher), the logical type will be
annotated as string, avoiding the issue. For example:
```
column_name: col1
physical type: binary
logical type: string
column_name: col2
physical type: binary
logical type: string
```
Something like this needed as a fix
```
...
} else {
switch (primitive.getPrimitiveTypeName()) {
case FIXED_LEN_BYTE_ARRAY:
case BINARY:
return new DictionaryBinaryAccessor<>((IntVector) vector,
dictionary, stringFactorySupplier.get())
...
...
private static class DictionaryBinaryAccessor<
DecimalT, Utf8StringT, ArrayT, ChildVectorT extends AutoCloseable>
extends ArrowVectorAccessor<DecimalT, Utf8StringT, ArrayT,
ChildVectorT> {
private final IntVector offsetVector;
private final Dictionary dictionary;
private final IntVector offsetVector;
private final Utf8StringT[] cache;
DictionaryBinaryAccessor(IntVector vector, Dictionary dictionary, ,
StringFactory<Utf8StringT> stringFactory) {
super(vector);
this.offsetVector = vector;
this.dictionary = dictionary;
this.stringFactory = stringFactory;
this.cache = genericArray(stringFactory.getGenericClass(),
dictionary.getMaxId() + 1);
}
@Override
public final byte[] getBinary(int rowId) {
return dictionary.decodeToBinary(offsetVector.get(rowId)).getBytes();
}
@Override
public final Utf8StringT getUTF8String(int rowId) {
int offset = offsetVector.get(rowId);
if (cache[offset] == null) {
cache[offset] =
stringFactory.ofByteBuffer(dictionary.decodeToBinary(offset).toByteBuffer());
}
return cache[offset];
}
}
```
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [X] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]