Hi all,

Wanted to let you know of a potential bug I've run into when loading
custom jar's dynamically (i.e. "ADD JAR /path/to/jar").  Hopefully
someone can tell me if this is a bug, expected behavior, or something
I'm causing myself :)

We have a custom StorageHandler that we're updating from Hive 1.2.1 to
Hive 3.0.0.  During testing we found that under some circumstances,
queries to tables backed by our StorageHandler would return result
sets with 'NULL' in each cell.  Digging in, we found that our SerDe's
deserialize() method was returning null after a failed "instanceof"
sanity check on the input Writable.  Debugging a bit, we found that
the "instanceof" operands were the same class/package, but had been
loaded by two different UDFClassLoader instances.  This behavior seems
suspiciously like what was warned against in an early comment on
HIVE-11878 when UDFClassLoader was introduced, so I'm 99% sure it is
unintended. (see:
https://issues.apache.org/jira/browse/HIVE-11878?focusedCommentId=14876858&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14876858)

The behavior is reproducible with the following steps:

1. Find a custom StorageHandler to use.  I wrote a stub StorageHandler
here (https://github.com/gerlowskija/hive-bug-serde/) which reproduces
the issue.
2. Create a table using the StorageHandler: hive -n $hive_user -p
$hive_pass -e "ADD JAR /tmp/mycustomserde.jar; CREATE EXTERNAL TABLE
my_ext_table (hello_col STRING, world_col STRING) STORED BY
'com.helloworld.serde.HelloWorldStorageHandler' LOCATION
'/tmp/some_dir';"
3. Put some data in your external table: hive -n $hive_user -p
$hive_pass -e "ADD JAR /tmp/mycustomserde.jar; INSERT INTO
my_ext_table VALUES ('hello', 'world');"
4. Query your external table: hive -n $hive_user -p $hive_pass -e "ADD
JAR /tmp/mycustomserde.jar; SELECT * FROM my_ext_table;"

Depending on the custom serde you're using the bug might exhibit
itself differently.  But most SerDe's, which cast the "Writable" arg
to a specific Writable implementation in their deserialize method,
will print a table full of 'NULL' values.  (The provided stub
StorageHandler shows the bug this way.  It also logs the "instanceof"
operands out to hiveserver2.log, making the behavior clearer:
"Received unexpected Writable class.  Expected
com.helloworld.serde.HelloWorldWritable from classloader
org.apache.hadoop.hive.ql.exec.UDFClassLoader@489d24e9, but actually
was com.helloworld.serde.HelloWorldWritable from classloader
org.apache.hadoop.hive.ql.exec.UDFClassLoader@75517e2b").

I've written the behavior and reproduction steps up in more detail
here: https://github.com/gerlowskija/hive-bug-serde/.  Please let me
know if this is a true bug in Hive as I suspect, or if there's
something I can be doing to avoid these Classloader conflicts.

Thanks,

Jason

Reply via email to