Often when this happens to me, it is actually an exception parsing a few messages. Easy to miss this, as error messages aren't always informative. I would be blaming spark, but in reality it was missing fields in a CSV file.
As has been said, make a file with a few records and see if your job works. On Thursday, January 22, 2015, Jörn Franke <jornfra...@gmail.com> wrote: > Did you try it with a smaller subset of the data first? > Le 23 janv. 2015 05:54, "Kane Kim" <kane.ist...@gmail.com > <javascript:_e(%7B%7D,'cvml','kane.ist...@gmail.com');>> a écrit : > >> I'm trying to process 5TB of data, not doing anything fancy, just >> map/filter and reduceByKey. Spent whole day today trying to get it >> processed, but never succeeded. I've tried to deploy to ec2 with the >> script provided with spark on pretty beefy machines (100 r3.2xlarge >> nodes). Really frustrated that spark doesn't work out of the box for >> anything bigger than word count sample. One big problem is that >> defaults are not suitable for processing big datasets, provided ec2 >> script could do a better job, knowing instance type requested. Second >> it takes hours to figure out what is wrong, when spark job fails >> almost finished processing. Even after raising all limits as per >> https://spark.apache.org/docs/latest/tuning.html it still fails (now >> with: error communicating with MapOutputTracker). >> >> After all I have only one question - how to get spark tuned up for >> processing terabytes of data and is there a way to make this >> configuration easier and more transparent? >> >> Thanks. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> <javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');> >> For additional commands, e-mail: user-h...@spark.apache.org >> <javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');> >> >> -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com