Re: using MultipleOutputFormat to ensure one output file per key

2014-11-25 Thread Rafal Kwasny
Hi, Arpan Ghosh wrote: Hi, How can I implement a custom MultipleOutputFormat and specify it as the output of my Spark job so that I can ensure that there is a unique output file per key (instead of a a unique output file per reducer)? I use something like this: class KeyBasedOutput[T :

Re: Spark output to s3 extremely slow

2014-10-15 Thread Rafal Kwasny
Hi, How large is the dataset you're saving into S3? Actually saving to S3 is done in two steps: 1) writing temporary files 2) commiting them to proper directory Step 2) could be slow because S3 do not have a quick atomic move operation, you have to copy (server side but still takes time) and then

Re: S3 Bucket Access

2014-10-14 Thread Rafal Kwasny
Hi, keep in mind that you're going to have a bad time if your secret key contains a / This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a / /Raf Akhil Das wrote: Try the following: 1. Set the

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-06 Thread Rafal Kwasny
Hi, This will work nicely unless you're using spot instances, in this case the start does not work as slaves are lost on shutdown. I feel like spark-ec2 script need a major refactor to cope with new features/more users using it in dynamic environments. Are there any current plans to migrate it to