ORC file question

2014-02-10 Thread Avrilia Floratou
Hi all, I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks. I have noticed that the first 180 map tasks finish in 3 secs (each) and after they

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
Hi Avrilia Is it a partitioned table? If so approximately how many partitions are there and how many files are there? What is the value for hive.input.format? My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had mentioned you are using default stripe size

Re: ORC file question

2014-02-10 Thread Avrilia Floratou
Hi Prasanth, No it's not a partitioned table. The table consists of only one file of (91.7 GB). When I created the table I loaded data from a text table to the orc table and used only 1 map task so that only one large file is created and not many small files. This is why I'm getting confused with

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
Hi Avrilia I have few more questions 1) Have you enabled ORC predicate pushdown by setting hive.optimize.index.filter? 2) What is the value for hive.input.format? 3) Which hive version are you using? 4) What query are you using? Thanks Prasanth Jayachandran On Feb 10, 2014, at 1:26 PM,

Re: ORC file question

2014-02-10 Thread Avrilia Floratou
Hi Prasanth, Here are the answers to your questions: 1) Yes I have set both set hive.optimize.ppd=true; set hive.optimize.index.filter=true; 2) From describe extended: inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 3) Hive 0.12 4) Select max (I1) from table; Thanks, Avrilia On

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
2) From describe extended: inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OrcInputFormat can be bypassed if hive.input.format is set to CombineHiveInputFormat. There are two different split computation code path both of which may generate different number of splits and hence

Re: ORC file question

2014-02-10 Thread Avrilia Floratou
Hi Prasanth, It seems that I was actually using the hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and that was generating 363 map tasks. I tried the org.apache. hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map tasks and get rid of the short map

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
Great to hear! Thanks Prasanth Jayachandran On Feb 10, 2014, at 2:50 PM, Avrilia Floratou avrilia.flora...@gmail.com wrote: Hi Prasanth, It seems that I was actually using the hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and that was generating 363 map tasks. I