Re: processing large dataset

2015-01-23 Thread Sean Owen
This is kinda a how-long-is-a-piece-of-string question. There is no one tuning for 'terabytes of data'. You can easily run a Spark job that processes hundreds of terabytes with no problem with defaults -- something trivial like counting. You can create Spark jobs that will never complete -- trying

processing large dataset

2015-01-22 Thread Kane Kim
I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated

Re: processing large dataset

2015-01-22 Thread Jörn Franke
Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit : I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've

Re: processing large dataset

2015-01-22 Thread Russell Jurney
Often when this happens to me, it is actually an exception parsing a few messages. Easy to miss this, as error messages aren't always informative. I would be blaming spark, but in reality it was missing fields in a CSV file. As has been said, make a file with a few records and see if your job