Re: processing large dataset

Russell Jurney Thu, 22 Jan 2015 23:05:47 -0800

Often when this happens to me, it is actually an exception parsing a few
messages. Easy to miss this, as error messages aren't always informative. I
would be blaming spark, but in reality it was missing fields in a CSV file.


As has been said, make a file with a few records and see if your job works.

On Thursday, January 22, 2015, Jörn Franke <jornfra...@gmail.com> wrote:

> Did you try it with a smaller subset of the data first?
> Le 23 janv. 2015 05:54, "Kane Kim" <kane.ist...@gmail.com
> <javascript:_e(%7B%7D,'cvml','kane.ist...@gmail.com');>> a écrit :
>
>> I'm trying to process 5TB of data, not doing anything fancy, just
>> map/filter and reduceByKey. Spent whole day today trying to get it
>> processed, but never succeeded. I've tried to deploy to ec2 with the
>> script provided with spark on pretty beefy machines (100 r3.2xlarge
>> nodes). Really frustrated that spark doesn't work out of the box for
>> anything bigger than word count sample. One big problem is that
>> defaults are not suitable for processing big datasets, provided ec2
>> script could do a better job, knowing instance type requested. Second
>> it takes hours to figure out what is wrong, when spark job fails
>> almost finished processing. Even after raising all limits as per
>> https://spark.apache.org/docs/latest/tuning.html it still fails (now
>> with: error communicating with MapOutputTracker).
>>
>> After all I have only one question - how to get spark tuned up for
>> processing terabytes of data and is there a way to make this
>> configuration easier and more transparent?
>>
>> Thanks.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');>
>> For additional commands, e-mail: user-h...@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');>
>>
>>

-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: processing large dataset

Reply via email to