Hi Prasanth, Here are the answers to your questions: 1) Yes I have set both set hive.optimize.ppd=true; set hive.optimize.index.filter=true; 2) From describe extended: inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 3) Hive 0.12 4) Select max (I1) from table;
Thanks, Avrilia On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > Hi Avrilia > > I have few more questions > > 1) Have you enabled ORC predicate pushdown by setting > hive.optimize.index.filter? > 2) What is the value for hive.input.format? > 3) Which hive version are you using? > 4) What query are you using? > > Thanks > Prasanth Jayachandran > > On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <avrilia.flora...@gmail.com> > wrote: > > Hi Prasanth, > > No it's not a partitioned table. The table consists of only one file of > (91.7 GB). When I created the table I loaded data from a text table to the > orc table and used only 1 map task so that only one large file is created > and not many small files. This is why I'm getting confused with this > behavior. It seems that the first 180 map tasks read a total of 3 MB only > (all together) and then the remaining map tasks do the actual work. Any > other idea on why this might be happening? > > Thanks, > Avrilia > > > On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> Hi Avrilia >> >> Is it a partitioned table? If so approximately how many partitions are >> there and how many files are there? What is the value for hive.input.format? >> >> My suspicion is that there are ~180 files and each file is ~515MB in >> size. Since, you had mentioned you are using default stripe size i.e, >> 256MB, the default HDFS block size for ORC files will be chose as 512MB. >> When a query is issued, the input files are split on HDFS block boundaries. >> So if the file size in a partition is 515MB there will be 2 splits per file >> (512MB on HDFS block boundary + remaining 3MB). This happens when the input >> format is set to HiveInputFormat. >> >> Thanks >> Prasanth Jayachandran >> >> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou < >> avrilia.flora...@gmail.com> wrote: >> >> > Hi all, >> > >> > I'm running a query that scans a file stored in ORC format and extracts >> some columns. My file is about 92 GB, uncompressed. I kept the default >> stripe size. The MapReduce job generates 363 map tasks. >> > >> > I have noticed that the first 180 map tasks finish in 3 secs (each) and >> after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then >> the remaining map tasks are the ones that scan the data and each one >> completes in about 20 sec. It seems that each of these map tasks gets as >> input 512 MB of the file. I was wondering, what exactly are the first short >> map tasks doing? >> > >> > Thanks, >> > Avrilia >> >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity >> to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader >> of this message is not the intended recipient, you are hereby notified >> that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender >> immediately >> and delete it from your system. Thank You. >> > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >