Hi all,
I'm running a query that scans a file stored in ORC format and extracts
some columns. My file is about 92 GB, uncompressed. I kept the default
stripe size. The MapReduce job generates 363 map tasks.
I have noticed that the first 180 map tasks finish in 3 secs (each) and
after they
Hi Avrilia
Is it a partitioned table? If so approximately how many partitions are there
and how many files are there? What is the value for hive.input.format?
My suspicion is that there are ~180 files and each file is ~515MB in size.
Since, you had mentioned you are using default stripe size
Hi Prasanth,
No it's not a partitioned table. The table consists of only one file of
(91.7 GB). When I created the table I loaded data from a text table to the
orc table and used only 1 map task so that only one large file is created
and not many small files. This is why I'm getting confused with
Hi Avrilia
I have few more questions
1) Have you enabled ORC predicate pushdown by setting
hive.optimize.index.filter?
2) What is the value for hive.input.format?
3) Which hive version are you using?
4) What query are you using?
Thanks
Prasanth Jayachandran
On Feb 10, 2014, at 1:26 PM,
Hi Prasanth,
Here are the answers to your questions:
1) Yes I have set both set hive.optimize.ppd=true; set
hive.optimize.index.filter=true;
2) From describe extended: inputFormat:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
3) Hive 0.12
4) Select max (I1) from table;
Thanks,
Avrilia
On
2) From describe extended: inputFormat:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OrcInputFormat can be bypassed if hive.input.format is set to
CombineHiveInputFormat. There are two different split computation code path
both of which may generate different number of splits and hence
Hi Prasanth,
It seems that I was actually using the
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and
that was generating 363 map tasks. I tried the org.apache.
hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map
tasks and get rid of the short map
Great to hear!
Thanks
Prasanth Jayachandran
On Feb 10, 2014, at 2:50 PM, Avrilia Floratou avrilia.flora...@gmail.com
wrote:
Hi Prasanth,
It seems that I was actually using the
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and
that was generating 363 map tasks. I