Our platform has a 40GB raw data file that was compressed lzo (12GB
compressed) to reduce network IO between S3.
Without indexing the file is unsplittable resulting in 1 map task and poor
cluster utilisation.
After indexing the file to be splitable the hive query produces 120 map
tasks.
However, with the 120 tasks distributed over a small 4 node cluster it
takes longer to process the data than when it wasn't splitable and
processing done by a single node (1h20mins vs 17mins). This was with a
fairly simple select from where query, without distinct, group by or order.
I'd like to utilise all nodes in the cluster, to reduce query time. Whats
the best way to have the data crunched in parallel but with fewer mappers?

Reply via email to