Re: ORC file question

Avrilia Floratou Mon, 10 Feb 2014 14:52:08 -0800

Hi Prasanth,

It seems that I was actually using the
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and
that was generating 363 map tasks. I tried the org.apache.
hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map
tasks and get rid of the short map tasks.


Thanks for your help!
Avrilia


On Mon, Feb 10, 2014 at 2:20 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

>
> 2) From describe extended:  inputFormat:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>
>
> OrcInputFormat can be bypassed if hive.input.format is set to
> CombineHiveInputFormat. There are two different split computation code path
> both of which may generate different number of splits and hence the number
> of mappers.
> If you are using Hive CLI to run your queries, then typing "set
> hive.input.format;" should tell you the input format used.
>
> Can you please report the number of mappers when using
> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat and when
> using hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
>
> My suspicion is that ORC generates wrong splits because of this bug
> https://issues.apache.org/jira/browse/HIVE-6326. I will try to reproduce
> your scenario and see if I hit similar issue.
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 10, 2014, at 1:46 PM, Avrilia Floratou <avrilia.flora...@gmail.com>
> wrote:
>
> Hi Prasanth,
> Here are the answers to your questions:
>
> 1) Yes I have set both set hive.optimize.ppd=true; set
> hive.optimize.index.filter=true;
> 2) From describe extended:  inputFormat:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> 3) Hive 0.12
> 4) Select max (I1) from table;
>
> Thanks,
> Avrilia
>
>
> On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com> wrote:
>
>> Hi Avrilia
>>
>> I have few more questions
>>
>> 1) Have you enabled ORC predicate pushdown by setting
>> hive.optimize.index.filter?
>> 2) What is the value for hive.input.format?
>> 3) Which hive version are you using?
>> 4) What query are you using?
>>
>> Thanks
>> Prasanth Jayachandran
>>
>> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <avrilia.flora...@gmail.com>
>> wrote:
>>
>> Hi Prasanth,
>>
>> No it's not a partitioned table. The table consists of only one file of
>> (91.7 GB). When I created the table I loaded data from a text table to the
>> orc table and used only 1 map task so that only one large file is created
>> and not many small files. This is why I'm getting confused with this
>> behavior. It seems that the first 180 map tasks read a total of 3 MB only
>> (all together) and then the remaining map tasks do the actual work. Any
>> other idea on why this might be happening?
>>
>> Thanks,
>> Avrilia
>>
>>
>> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <
>> pjayachand...@hortonworks.com> wrote:
>>
>>> Hi Avrilia
>>>
>>> Is it a partitioned table? If so approximately how many partitions are
>>> there and how many files are there? What is the value for hive.input.format?
>>>
>>> My suspicion is that there are ~180 files and each file is ~515MB in
>>> size. Since, you had mentioned you are using default stripe size i.e,
>>> 256MB, the default HDFS block size for ORC files will be chose as 512MB.
>>> When a query is issued, the input files are split on HDFS block boundaries.
>>> So if the file size in a partition is 515MB there will be 2 splits per file
>>> (512MB on HDFS block boundary + remaining 3MB). This happens when the input
>>> format is set to HiveInputFormat.
>>>
>>> Thanks
>>> Prasanth Jayachandran
>>>
>>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <
>>> avrilia.flora...@gmail.com> wrote:
>>>
>>> > Hi all,
>>> >
>>> > I'm running a query that scans a file stored in ORC format and
>>> extracts some columns. My file is about 92 GB, uncompressed. I kept the
>>> default stripe size. The MapReduce job generates 363 map tasks.
>>> >
>>> > I have noticed that the first 180 map tasks finish in 3 secs (each)
>>> and after they complete the HDFS_BYTES_READ counter is equal to about 3MB.
>>> Then the remaining map tasks are the ones that scan the data and each one
>>> completes in about 20 sec. It seems that each of these map tasks gets as
>>> input 512 MB of the file. I was wondering, what exactly are the first short
>>> map tasks doing?
>>> >
>>> > Thanks,
>>> > Avrilia
>>>
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately
>>> and delete it from your system. Thank You.
>>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: ORC file question

Reply via email to