I'm trying to process 5TB of data, not doing anything fancy, just
map/filter and reduceByKey. Spent whole day today trying to get it
processed, but never succeeded. I've tried to deploy to ec2 with the
script provided with spark on pretty beefy machines (100 r3.2xlarge
nodes). Really frustrated that spark doesn't work out of the box for
anything bigger than word count sample. One big problem is that
defaults are not suitable for processing big datasets, provided ec2
script could do a better job, knowing instance type requested. Second
it takes hours to figure out what is wrong, when spark job fails
almost finished processing. Even after raising all limits as per
https://spark.apache.org/docs/latest/tuning.html it still fails (now
with: error communicating with MapOutputTracker).

After all I have only one question - how to get spark tuned up for
processing terabytes of data and is there a way to make this
configuration easier and more transparent?

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to