Hi Prasanth, It seems that I was actually using the hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and that was generating 363 map tasks. I tried the org.apache. hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map tasks and get rid of the short map tasks.
Thanks for your help! Avrilia On Mon, Feb 10, 2014 at 2:20 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > > 2) From describe extended: inputFormat: > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > > > OrcInputFormat can be bypassed if hive.input.format is set to > CombineHiveInputFormat. There are two different split computation code path > both of which may generate different number of splits and hence the number > of mappers. > If you are using Hive CLI to run your queries, then typing "set > hive.input.format;" should tell you the input format used. > > Can you please report the number of mappers when using > hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat and when > using hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. > > My suspicion is that ORC generates wrong splits because of this bug > https://issues.apache.org/jira/browse/HIVE-6326. I will try to reproduce > your scenario and see if I hit similar issue. > > Thanks > Prasanth Jayachandran > > On Feb 10, 2014, at 1:46 PM, Avrilia Floratou <avrilia.flora...@gmail.com> > wrote: > > Hi Prasanth, > Here are the answers to your questions: > > 1) Yes I have set both set hive.optimize.ppd=true; set > hive.optimize.index.filter=true; > 2) From describe extended: inputFormat: > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > 3) Hive 0.12 > 4) Select max (I1) from table; > > Thanks, > Avrilia > > > On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> Hi Avrilia >> >> I have few more questions >> >> 1) Have you enabled ORC predicate pushdown by setting >> hive.optimize.index.filter? >> 2) What is the value for hive.input.format? >> 3) Which hive version are you using? >> 4) What query are you using? >> >> Thanks >> Prasanth Jayachandran >> >> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <avrilia.flora...@gmail.com> >> wrote: >> >> Hi Prasanth, >> >> No it's not a partitioned table. The table consists of only one file of >> (91.7 GB). When I created the table I loaded data from a text table to the >> orc table and used only 1 map task so that only one large file is created >> and not many small files. This is why I'm getting confused with this >> behavior. It seems that the first 180 map tasks read a total of 3 MB only >> (all together) and then the remaining map tasks do the actual work. Any >> other idea on why this might be happening? >> >> Thanks, >> Avrilia >> >> >> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran < >> pjayachand...@hortonworks.com> wrote: >> >>> Hi Avrilia >>> >>> Is it a partitioned table? If so approximately how many partitions are >>> there and how many files are there? What is the value for hive.input.format? >>> >>> My suspicion is that there are ~180 files and each file is ~515MB in >>> size. Since, you had mentioned you are using default stripe size i.e, >>> 256MB, the default HDFS block size for ORC files will be chose as 512MB. >>> When a query is issued, the input files are split on HDFS block boundaries. >>> So if the file size in a partition is 515MB there will be 2 splits per file >>> (512MB on HDFS block boundary + remaining 3MB). This happens when the input >>> format is set to HiveInputFormat. >>> >>> Thanks >>> Prasanth Jayachandran >>> >>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou < >>> avrilia.flora...@gmail.com> wrote: >>> >>> > Hi all, >>> > >>> > I'm running a query that scans a file stored in ORC format and >>> extracts some columns. My file is about 92 GB, uncompressed. I kept the >>> default stripe size. The MapReduce job generates 363 map tasks. >>> > >>> > I have noticed that the first 180 map tasks finish in 3 secs (each) >>> and after they complete the HDFS_BYTES_READ counter is equal to about 3MB. >>> Then the remaining map tasks are the ones that scan the data and each one >>> completes in about 20 sec. It seems that each of these map tasks gets as >>> input 512 MB of the file. I was wondering, what exactly are the first short >>> map tasks doing? >>> > >>> > Thanks, >>> > Avrilia >>> >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >>> to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified >>> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >>> immediately >>> and delete it from your system. Thank You. >>> >> >> >> >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity >> to which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader >> of this message is not the intended recipient, you are hereby notified that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender immediately >> and delete it from your system. Thank You. >> > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >