> another case of a query hangin' in v2.1.0.
I'm not sure that's a hang. If you can repro this, can you please do a jstack
while it is "hanging" (like a jstack of hiveserver2 or cli)?
I have a theory that you're hitting a slow path in HDFS remote read because of
the following stacktrace.
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:700)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at
org.apache.hadoop.io.SequenceFile$Reader.readBlock(SequenceFile.java:2101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2508)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:484)
Notice that it is firing off a 4 byte HDFS read call without buffering - this
is probably because Compression is usually the natural buffering mode for the
SequenceFiles.
The uncompressed data might be triggering a 4 byte remote read directly, which
would be an extremely slow way to read data out of HDFS.
> * so empty result expected.
The empty result is the worst-case scenario for the FetchTask optimization,
because it means the CLI tool deserializes every single row in a single thread.
ORC which has internal indexes is somewhat safe against that.
> set hive.fetch.task.conversion=none;
> but not sure its the right thing to set globally just yet.
No, it's not - the right setting is to tune the size threshold for that
optimization.
hive.fetch.task.conversion.threshold;
Setting that to <=1G bytes can be a win, while setting that to -1 can cause so
much pain.
Cheers,
Gopal