Hi all, I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks.
I have noticed that the first 180 map tasks finish in 3 secs (each) and after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then the remaining map tasks are the ones that scan the data and each one completes in about 20 sec. It seems that each of these map tasks gets as input 512 MB of the file. I was wondering, what exactly are the first short map tasks doing? Thanks, Avrilia