RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-20 Thread Shuai Zheng
Thanks! Let me update the status. I have copied the DirectOutputCommitter to my local. And set: Conf.set(spark.hadoop.mapred.output.committer.class, org..DirectOutputCommitter) It works perfectly. Thanks everyone J Regards, Shuai From: Aaron Davidson

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-17 Thread Imran Rashid
I'm not super familiar w/ S3, but I think the issue is that you want to use a different output committers with object stores, that don't have a simple move operation. There have been a few other threads on S3 outputcommitters. I think the most relevant for you is most probably this open JIRA:

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-17 Thread Aaron Davidson
Actually, this is the more relevant JIRA (which is resolved): https://issues.apache.org/jira/browse/SPARK-3595 6352 is about saveAsParquetFile, which is not in use here. Here is a DirectOutputCommitter implementation: https://gist.github.com/aarondav/c513916e72101bbe14ec and it can be

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-16 Thread Akhil Das
If you use fileStream, there's an option to filter out files. In your case you can easily create a filter to remove _temporary files. In that case, you will have to move your codes inside foreachRDD of the dstream since the application will become a streaming app. Thanks Best Regards On Sat, Mar

Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
Hi All, I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it as a single node cluster for test. The data I use to sort is around 4GB and sit on S3, output will also on S3. I just connect spark-shell to the local cluster and run the code in the script (because I just

RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
And one thing forget to mention, even I have this exception and the result is not well format in my target folder (part of them are there, rest are under different folder structure of _tempoary folder). In the webUI of spark-shell, it is still be marked as successful step. I think this is a bug?