ORC file question

Avrilia Floratou Mon, 10 Feb 2014 00:51:25 -0800

Hi all,

I'm running a query that scans a file stored in ORC format and extracts
some columns. My file is about 92 GB, uncompressed. I kept the default
stripe size. The MapReduce job generates 363 map tasks.


I have noticed that the first 180 map tasks finish in 3 secs (each) and
after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then
the remaining map tasks are the ones that scan the data and each one
completes in about 20 sec. It seems that each of these map tasks gets as
input 512 MB of the file. I was wondering, what exactly are the first short
map tasks doing?

Thanks,
Avrilia

ORC file question

Reply via email to