I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated that spark doesn't work out of the box for anything bigger than word count sample. One big problem is that defaults are not suitable for processing big datasets, provided ec2 script could do a better job, knowing instance type requested. Second it takes hours to figure out what is wrong, when spark job fails almost finished processing. Even after raising all limits as per https://spark.apache.org/docs/latest/tuning.html it still fails (now with: error communicating with MapOutputTracker).
After all I have only one question - how to get spark tuned up for processing terabytes of data and is there a way to make this configuration easier and more transparent? Thanks. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org