Re: Hive on Tez much slower than MR

Jörn Franke Thu, 06 Aug 2015 00:08:24 -0700

Always use the newest version of Hive. You should use orc or parquet
wherever possible. If you use orc then you should explicitly enable storage
indexes and insert your table sorted (eg for the query below you would sort
on x). Additionally you should enable statistics.


Compression may bring additional performance gains. If you use orc or
parquet then all compression algorithms are splittable.

Le jeu. 6 août 2015 à 8:11, Bill Slacum <wsla...@gmail.com> a écrit :

> I was able to bring the performance in line with MR by enabling reduce
> side vectorization, which apparently wasn't enabled in my cluster. The
> documentation regarding this is odd as it says ORC is required, but none of
> my tables are using ORC.
>
>
>
> On Aug 5, 2015, at 3:48 PM, William Slacum <wsla...@gmail.com> wrote:
>
> Hi all,
>
> I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0.
>
> I have a very simple query of the form `select count(*) from my_table
> where x > 0 and x < 1500`.
>
> The table has ~50 columns in it and not all are populated. My total
> dataset size is ~20TB. When I run with MapReduce, I can generally see a
> mapper pull through ~100k records in a few seconds. The MR job, in total,
> takes about 2 minutes.
>
> If all I do is set `hive.execution.engine=tez`, I end up getting a similar
> number of Map tasks for Tez, but after 30 minutes or so they aren't
> completed. I don't have much insight into what's going on.
>
> I have confirmed the following:
>
> 1) Usually about 10 TezChild tasks are executed on a single node.
> 2) Each one is using greater than 100% CPU, but less than 150% CPU.
> 3) When I jstack a random task, it's usually generating a
> NumberFormatException. The stack trace will be available below, but it
> looks like when an expected byte column is null or empty, LazyInteger#parse
> throws a NumberFormatException and LazyByte#init swallows it and sets some
> default value.
> 4) The worker will log a record count every time it reaches some power 10.
> For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking
> 5-10 minutes for 10,000 records.
>
> My gut tells me that #3 is my issue (with #4 being a symptom), since in my
> experience continual exception creation can be a performance killer.
> However, I haven't been able to confirm that the logic for processing a row
> is actually different between Tez and MR.
>
> Any thing I should check or try to tweak to get around this?
>
> Here's the stacktrace:
>
> Thread 6127: (state = IN_VM)
>
> - java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame;
> information may be imprecise)
>
> - java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled frame)
>
> - java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265 (Compiled
> frame)
>
> - java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled
> frame)
>
> - java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62
> (Compiled frame)
>
> - java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2,
> line=53 (Compiled frame)
>
> - java.lang.NumberFormatException.<init>(java.lang.String) @bci=2, line=55
> (Compiled frame)
>
> - org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int,
> int, int) @bci=62, line=104 (Compiled frame)
>
> - org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int,
> int) @bci=4, line=94 (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef,
> int, int) @bci=15, line=52 (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField()
> @bci=101, line=111 (Compiled frame)
>
> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int)
> @bci=6, line=172 (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object,
> org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67
> (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object)
> @bci=53, line=394 (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable,
> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137
> (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx,
> org.apache.hadoop.io.Writable,
> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100
> (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable)
> @bci=57, line=492 (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object)
> @bci=20, line=83 (Compiled frame)
>
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord() @bci=40,
> line=68 (Compiled frame)
>
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9,
> line=294 (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map,
> java.util.Map) @bci=224, line=163 (Interpreted frame)
> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map,
> java.util.Map)
>
>

Re: Hive on Tez much slower than MR

Reply via email to