Re: Spark on AWS
Hi, I agree with Steve, just start using vanilla SPARK EMR. You can try to see point #4 here for dynamic allocation of executors https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin . Note that dynamic allocation of executors takes a bit of time for the jobs to start running, therefore you can provide another suggestion to EMR clusters while starting so that they allocate maximum possible processing to executors as the EMR clusters start using maximizeResourceAllocation as mentioned here: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html In case you are trying to load enough data in the spark Master node for graphing or exploratory analysis using Matlab, seaborn or bokeh its better to increase the driver memory by recreating spark context. Regards Gourav Sengupta On Mon, May 2, 2016 at 12:54 AM, Teng Qiu wrote: > Hi, here we made several optimizations for accessing s3 from spark: > > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando > > such as: > > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133 > > you can deploy our spark package using our docker image, just simply: > > docker run -d --net=host \ >-e START_MASTER="true" \ >-e START_WORKER="true" \ >-e START_WEBAPP="true" \ >-e START_NOTEBOOK="true" \ >registry.opensource.zalan.do/bi/spark:1.6.2-6 > > > a jupyter notebook will running on port > > > have fun > > Best, > > Teng > > 2016-04-29 12:37 GMT+02:00 Steve Loughran : > > > > On 28 Apr 2016, at 22:59, Alexander Pivovarov > wrote: > > > > Spark works well with S3 (read and write). However it's recommended to > set > > spark.speculation true (it's expected that some tasks fail if you read > large > > S3 folder, so speculation should help) > > > > > > > > I must disagree. > > > > Speculative execution has >1 executor running the query, with whoever > > finishes first winning. > > however, "finishes first" is implemented in the output committer, by > > renaming the attempt's output directory to the final output directory: > > whoever renames first wins. > > This relies on rename() being implemented in the filesystem client as an > > atomic transaction. > > Unfortunately, S3 doesn't do renames. Instead every file gets copied to > one > > of the new name, then the old file deleted; an operation that takes time > > O(data * files) > > > > if you have more than one executor trying to commit the work > simultaneously, > > your output will be mess of both executions, without anything detecting > and > > reporting it. > > > > Where did you find this recommendation to set speculation=true? > > > > -Steve > > > > see also: https://issues.apache.org/jira/browse/SPARK-10063 > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark on AWS
Hi, here we made several optimizations for accessing s3 from spark: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando such as: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133 you can deploy our spark package using our docker image, just simply: docker run -d --net=host \ -e START_MASTER="true" \ -e START_WORKER="true" \ -e START_WEBAPP="true" \ -e START_NOTEBOOK="true" \ registry.opensource.zalan.do/bi/spark:1.6.2-6 a jupyter notebook will running on port have fun Best, Teng 2016-04-29 12:37 GMT+02:00 Steve Loughran : > > On 28 Apr 2016, at 22:59, Alexander Pivovarov wrote: > > Spark works well with S3 (read and write). However it's recommended to set > spark.speculation true (it's expected that some tasks fail if you read large > S3 folder, so speculation should help) > > > > I must disagree. > > Speculative execution has >1 executor running the query, with whoever > finishes first winning. > however, "finishes first" is implemented in the output committer, by > renaming the attempt's output directory to the final output directory: > whoever renames first wins. > This relies on rename() being implemented in the filesystem client as an > atomic transaction. > Unfortunately, S3 doesn't do renames. Instead every file gets copied to one > of the new name, then the old file deleted; an operation that takes time > O(data * files) > > if you have more than one executor trying to commit the work simultaneously, > your output will be mess of both executions, without anything detecting and > reporting it. > > Where did you find this recommendation to set speculation=true? > > -Steve > > see also: https://issues.apache.org/jira/browse/SPARK-10063 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark on AWS
On 28 Apr 2016, at 22:59, Alexander Pivovarov mailto:apivova...@gmail.com>> wrote: Spark works well with S3 (read and write). However it's recommended to set spark.speculation true (it's expected that some tasks fail if you read large S3 folder, so speculation should help) I must disagree. 1. Speculative execution has >1 executor running the query, with whoever finishes first winning. 2. however, "finishes first" is implemented in the output committer, by renaming the attempt's output directory to the final output directory: whoever renames first wins. 3. This relies on rename() being implemented in the filesystem client as an atomic transaction. 4. Unfortunately, S3 doesn't do renames. Instead every file gets copied to one of the new name, then the old file deleted; an operation that takes time O(data * files) if you have more than one executor trying to commit the work simultaneously, your output will be mess of both executions, without anything detecting and reporting it. Where did you find this recommendation to set speculation=true? -Steve see also: https://issues.apache.org/jira/browse/SPARK-10063
Re: Spark on AWS
Thanks for the responses. Fatma On Apr 28, 2016 3:00 PM, "Renato Perini" wrote: > I have setup a small development cluster using t2.micro machines and an > Amazon Linux AMI (CentOS 6.x). > The whole setup has been done manually, without using the provided > scripts. The whole setup is composed of a total of 5 instances: the first > machine has an elastic IP and it is used as a bridge to access the other 4 > machines (they don't have elastic IPs). The second machine runs a > standalone single node Spark cluster (1 master, 1 worker). The other 3 > machines are configured as an Apache Cassandra cluster. I have tuned the > JVM and lots of parameters. I do not use S3 nor HDFS, I just write data > using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra nodes, > that I use for data retrieval. Data is then processed through regular Spark > jobs. > The jobs are submitted to the cluster using LinkedIn Azkaban, executing > custom shell scripts written by me for wrapping the submitting process and > handling eventual command line arguments, at scheduled intervals. Results > are written directly to other Cassandra tables or in a specific folder on > the filesystem using the regular CSV format. > The system is completely autonomous and requires little to no manual > administration. > > I'm quite satisfied with it, considering how small and limited the > machines involved are. But it required lots of tuning work, because we are > clearly under the recommended requirements. 4 of the 5 machines are > switched off during the night, only the bridge machine is alive 24/7. > > 12$ per month in total. > > Renato Perini. > > > Il 28/04/2016 23:39, Fatma Ozcan ha scritto: > >> What is your experience using Spark on AWS? Are you setting up your own >> Spark cluster, and using HDFS? Or are you using Spark as a service from >> AWS? In the latter case, what is your experience of using S3 directly, >> without having HDFS in between? >> >> Thanks, >> Fatma >> > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark on AWS
I have setup a small development cluster using t2.micro machines and an Amazon Linux AMI (CentOS 6.x). The whole setup has been done manually, without using the provided scripts. The whole setup is composed of a total of 5 instances: the first machine has an elastic IP and it is used as a bridge to access the other 4 machines (they don't have elastic IPs). The second machine runs a standalone single node Spark cluster (1 master, 1 worker). The other 3 machines are configured as an Apache Cassandra cluster. I have tuned the JVM and lots of parameters. I do not use S3 nor HDFS, I just write data using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra nodes, that I use for data retrieval. Data is then processed through regular Spark jobs. The jobs are submitted to the cluster using LinkedIn Azkaban, executing custom shell scripts written by me for wrapping the submitting process and handling eventual command line arguments, at scheduled intervals. Results are written directly to other Cassandra tables or in a specific folder on the filesystem using the regular CSV format. The system is completely autonomous and requires little to no manual administration. I'm quite satisfied with it, considering how small and limited the machines involved are. But it required lots of tuning work, because we are clearly under the recommended requirements. 4 of the 5 machines are switched off during the night, only the bridge machine is alive 24/7. 12$ per month in total. Renato Perini. Il 28/04/2016 23:39, Fatma Ozcan ha scritto: What is your experience using Spark on AWS? Are you setting up your own Spark cluster, and using HDFS? Or are you using Spark as a service from AWS? In the latter case, what is your experience of using S3 directly, without having HDFS in between? Thanks, Fatma - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark on AWS
Fatima, the easiest way to create Spark cluster on AWS is to create EMR cluster and select Spark application. (the latest EMR includes Spark 1.6.1) Spark works well with S3 (read and write). However it's recommended to set spark.speculation true (it's expected that some tasks fail if you read large S3 folder, so speculation should help) On Thu, Apr 28, 2016 at 2:39 PM, Fatma Ozcan wrote: > What is your experience using Spark on AWS? Are you setting up your own > Spark cluster, and using HDFS? Or are you using Spark as a service from > AWS? In the latter case, what is your experience of using S3 directly, > without having HDFS in between? > > Thanks, > Fatma >
Spark on AWS
What is your experience using Spark on AWS? Are you setting up your own Spark cluster, and using HDFS? Or are you using Spark as a service from AWS? In the latter case, what is your experience of using S3 directly, without having HDFS in between? Thanks, Fatma
spark.eventLog.enabled not working on spark on AWS EC2
I have been trying to record event logging in a standalone application submitted to spark on AWS EC2. However, the application keeps failing when trying to write to the event logs. I tried various different logging directories by setting spark.eventLog.dir but it does not work. I tried the following directories (I made sure all the directories were created and had the right permissions): hdfs:///spark_logs /root/spark_logs hdfs://:8020/spark_logs Nothing seems to work. Can you give me some advise why it is not working? Zhen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-eventLog-enabled-not-working-on-spark-on-AWS-EC2-tp7605.html Sent from the Apache Spark User List mailing list archive at Nabble.com.