Hive on Spark knobs

Ruslan Dautkhanov Wed, 27 Jan 2016 09:52:37 -0800

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started


There are quite a lot of knobs to tune for Hive on Spark.

Above page recommends following settings:

mapreduce.input.fileinputformat.split.maxsize=750000000
> hive.vectorized.execution.enabled=true
> hive.cbo.enable=true
> hive.optimize.reducededuplication.min.reducer=4
> hive.optimize.reducededuplication=true
> hive.orc.splits.include.file.footer=false
> hive.merge.mapfiles=true
> hive.merge.sparkfiles=false
> hive.merge.smallfiles.avgsize=16000000
> hive.merge.size.per.task=256000000
> hive.merge.orcfile.stripe.level=true
> hive.auto.convert.join=true
> hive.auto.convert.join.noconditionaltask=true
> hive.auto.convert.join.noconditionaltask.size=894435328
> hive.optimize.bucketmapjoin.sortedmerge=false
> hive.map.aggr.hash.percentmemory=0.5
> hive.map.aggr=true
> hive.optimize.sort.dynamic.partition=false
> hive.stats.autogather=true
> hive.stats.fetch.column.stats=true
> hive.vectorized.execution.reduce.enabled=false
> hive.vectorized.groupby.checkinterval=4096
> hive.vectorized.groupby.flush.percent=0.1
> hive.compute.query.using.stats=true
> hive.limit.pushdown.memory.usage=0.4
> hive.optimize.index.filter=true
> hive.exec.reducers.bytes.per.reducer=67108864
> hive.smbjoin.cache.rows=10000
> hive.exec.orc.default.stripe.size=67108864
> hive.fetch.task.conversion=more
> hive.fetch.task.conversion.threshold=1073741824
> hive.fetch.task.aggr=false
> mapreduce.input.fileinputformat.list-status.num-threads=5
> spark.kryo.referenceTracking=false
>
> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch


Did it work for everybody? It may take days if not weeks to try to tune all
of these parameters for a specific job.

We're on Spark 1.5 / Hive 1.1.


ps. We have a job that can't get working well as a Hive job, so thought to
use Hive on Spark instead. (a 3-table full outer joins with group by +
collect_list). Spark should handle this much better.


Ruslan

Hive on Spark knobs

Reply via email to