Hey Jörn, thanks for the response! Unfortunately I'm kinda stuck on the
version I am. We do plan on moving to ORC at some point.

I need to dig more into the implementation of how Vectorized execution
works. The documentation (
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution)
mentions ORC, but I guess I don't quite understand the requirement, unless
all data is stored in ORC (even intermediate data between some map work and
some reduce work).

Thanks,
Bill

On Thu, Aug 6, 2015 at 2:05 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Always use the newest version of Hive. You should use orc or parquet
> wherever possible. If you use orc then you should explicitly enable storage
> indexes and insert your table sorted (eg for the query below you would sort
> on x). Additionally you should enable statistics.
>
> Compression may bring additional performance gains. If you use orc or
> parquet then all compression algorithms are splittable.
>
> Le jeu. 6 août 2015 à 8:11, Bill Slacum <wsla...@gmail.com> a écrit :
>
>> I was able to bring the performance in line with MR by enabling reduce
>> side vectorization, which apparently wasn't enabled in my cluster. The
>> documentation regarding this is odd as it says ORC is required, but none of
>> my tables are using ORC.
>>
>>
>>
>> On Aug 5, 2015, at 3:48 PM, William Slacum <wsla...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0.
>>
>> I have a very simple query of the form `select count(*) from my_table
>> where x > 0 and x < 1500`.
>>
>> The table has ~50 columns in it and not all are populated. My total
>> dataset size is ~20TB. When I run with MapReduce, I can generally see a
>> mapper pull through ~100k records in a few seconds. The MR job, in total,
>> takes about 2 minutes.
>>
>> If all I do is set `hive.execution.engine=tez`, I end up getting a
>> similar number of Map tasks for Tez, but after 30 minutes or so they aren't
>> completed. I don't have much insight into what's going on.
>>
>> I have confirmed the following:
>>
>> 1) Usually about 10 TezChild tasks are executed on a single node.
>> 2) Each one is using greater than 100% CPU, but less than 150% CPU.
>> 3) When I jstack a random task, it's usually generating a
>> NumberFormatException. The stack trace will be available below, but it
>> looks like when an expected byte column is null or empty, LazyInteger#parse
>> throws a NumberFormatException and LazyByte#init swallows it and sets some
>> default value.
>> 4) The worker will log a record count every time it reaches some power
>> 10. For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking
>> 5-10 minutes for 10,000 records.
>>
>> My gut tells me that #3 is my issue (with #4 being a symptom), since in
>> my experience continual exception creation can be a performance killer.
>> However, I haven't been able to confirm that the logic for processing a row
>> is actually different between Tez and MR.
>>
>> Any thing I should check or try to tweak to get around this?
>>
>> Here's the stacktrace:
>>
>> Thread 6127: (state = IN_VM)
>>
>> - java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame;
>> information may be imprecise)
>>
>> - java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled
>> frame)
>>
>> - java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265
>> (Compiled frame)
>>
>> - java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled
>> frame)
>>
>> - java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62
>> (Compiled frame)
>>
>> - java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2,
>> line=53 (Compiled frame)
>>
>> - java.lang.NumberFormatException.<init>(java.lang.String) @bci=2,
>> line=55 (Compiled frame)
>>
>> - org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int,
>> int, int) @bci=62, line=104 (Compiled frame)
>>
>> - org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int,
>> int) @bci=4, line=94 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef,
>> int, int) @bci=15, line=52 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField()
>> @bci=101, line=111 (Compiled frame)
>>
>> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int)
>> @bci=6, line=172 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object,
>> org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67
>> (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object)
>> @bci=53, line=394 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable,
>> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137
>> (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx,
>> org.apache.hadoop.io.Writable,
>> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100
>> (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable)
>> @bci=57, line=492 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object)
>> @bci=20, line=83 (Compiled frame)
>>
>> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord()
>> @bci=40, line=68 (Compiled frame)
>>
>> - org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9,
>> line=294 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map,
>> java.util.Map) @bci=224, line=163 (Interpreted frame)
>> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map,
>> java.util.Map)
>>
>>

Reply via email to