Hey Jörn, thanks for the response! Unfortunately I'm kinda stuck on the version I am. We do plan on moving to ORC at some point.
I need to dig more into the implementation of how Vectorized execution works. The documentation ( https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution) mentions ORC, but I guess I don't quite understand the requirement, unless all data is stored in ORC (even intermediate data between some map work and some reduce work). Thanks, Bill On Thu, Aug 6, 2015 at 2:05 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Always use the newest version of Hive. You should use orc or parquet > wherever possible. If you use orc then you should explicitly enable storage > indexes and insert your table sorted (eg for the query below you would sort > on x). Additionally you should enable statistics. > > Compression may bring additional performance gains. If you use orc or > parquet then all compression algorithms are splittable. > > Le jeu. 6 août 2015 à 8:11, Bill Slacum <wsla...@gmail.com> a écrit : > >> I was able to bring the performance in line with MR by enabling reduce >> side vectorization, which apparently wasn't enabled in my cluster. The >> documentation regarding this is odd as it says ORC is required, but none of >> my tables are using ORC. >> >> >> >> On Aug 5, 2015, at 3:48 PM, William Slacum <wsla...@gmail.com> wrote: >> >> Hi all, >> >> I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0. >> >> I have a very simple query of the form `select count(*) from my_table >> where x > 0 and x < 1500`. >> >> The table has ~50 columns in it and not all are populated. My total >> dataset size is ~20TB. When I run with MapReduce, I can generally see a >> mapper pull through ~100k records in a few seconds. The MR job, in total, >> takes about 2 minutes. >> >> If all I do is set `hive.execution.engine=tez`, I end up getting a >> similar number of Map tasks for Tez, but after 30 minutes or so they aren't >> completed. I don't have much insight into what's going on. >> >> I have confirmed the following: >> >> 1) Usually about 10 TezChild tasks are executed on a single node. >> 2) Each one is using greater than 100% CPU, but less than 150% CPU. >> 3) When I jstack a random task, it's usually generating a >> NumberFormatException. The stack trace will be available below, but it >> looks like when an expected byte column is null or empty, LazyInteger#parse >> throws a NumberFormatException and LazyByte#init swallows it and sets some >> default value. >> 4) The worker will log a record count every time it reaches some power >> 10. For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking >> 5-10 minutes for 10,000 records. >> >> My gut tells me that #3 is my issue (with #4 being a symptom), since in >> my experience continual exception creation can be a performance killer. >> However, I haven't been able to confirm that the logic for processing a row >> is actually different between Tez and MR. >> >> Any thing I should check or try to tweak to get around this? >> >> Here's the stacktrace: >> >> Thread 6127: (state = IN_VM) >> >> - java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame; >> information may be imprecise) >> >> - java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled >> frame) >> >> - java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265 >> (Compiled frame) >> >> - java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled >> frame) >> >> - java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62 >> (Compiled frame) >> >> - java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2, >> line=53 (Compiled frame) >> >> - java.lang.NumberFormatException.<init>(java.lang.String) @bci=2, >> line=55 (Compiled frame) >> >> - org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int, >> int, int) @bci=62, line=104 (Compiled frame) >> >> - org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int, >> int) @bci=4, line=94 (Compiled frame) >> >> - >> org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef, >> int, int) @bci=15, line=52 (Compiled frame) >> >> - >> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField() >> @bci=101, line=111 (Compiled frame) >> >> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int) >> @bci=6, line=172 (Compiled frame) >> >> - >> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object, >> org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67 >> (Compiled frame) >> >> - >> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object) >> @bci=53, line=394 (Compiled frame) >> >> - >> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable, >> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137 >> (Compiled frame) >> >> - >> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx, >> org.apache.hadoop.io.Writable, >> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100 >> (Compiled frame) >> >> - >> org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable) >> @bci=57, line=492 (Compiled frame) >> >> - >> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object) >> @bci=20, line=83 (Compiled frame) >> >> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord() >> @bci=40, line=68 (Compiled frame) >> >> - org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9, >> line=294 (Compiled frame) >> >> - >> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map, >> java.util.Map) @bci=224, line=163 (Interpreted frame) >> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map, >> java.util.Map) >> >>