Re: Spark on AWS

2016-05-02 Thread Gourav Sengupta
Hi,

I agree with Steve, just start using vanilla SPARK EMR.

You can try to see point #4 here for dynamic allocation of executors
https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin
.

Note that dynamic allocation of executors takes a bit of time for the jobs
to start running, therefore you can provide another suggestion to EMR
clusters while starting so that they allocate maximum possible processing
to executors as the EMR clusters start using maximizeResourceAllocation as
mentioned here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html

In case you are trying to load enough data in the spark Master node for
graphing or exploratory analysis using Matlab, seaborn or bokeh its better
to increase the driver memory by recreating spark context.


Regards
Gourav Sengupta



On Mon, May 2, 2016 at 12:54 AM, Teng Qiu  wrote:

> Hi, here we made several optimizations for accessing s3 from spark:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>
> such as:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133
>
> you can deploy our spark package using our docker image, just simply:
>
> docker run -d --net=host \
>-e START_MASTER="true" \
>-e START_WORKER="true" \
>-e START_WEBAPP="true" \
>-e START_NOTEBOOK="true" \
>registry.opensource.zalan.do/bi/spark:1.6.2-6
>
>
> a jupyter notebook will running on port 
>
>
> have fun
>
> Best,
>
> Teng
>
> 2016-04-29 12:37 GMT+02:00 Steve Loughran :
> >
> > On 28 Apr 2016, at 22:59, Alexander Pivovarov 
> wrote:
> >
> > Spark works well with S3 (read and write). However it's recommended to
> set
> > spark.speculation true (it's expected that some tasks fail if you read
> large
> > S3 folder, so speculation should help)
> >
> >
> >
> > I must disagree.
> >
> > Speculative execution has >1 executor running the query, with whoever
> > finishes first winning.
> > however, "finishes first" is implemented in the output committer, by
> > renaming the attempt's output directory to the final output directory:
> > whoever renames first wins.
> > This relies on rename() being implemented in the filesystem client as an
> > atomic transaction.
> > Unfortunately, S3 doesn't do renames. Instead every file gets copied to
> one
> > of the new name, then the old file deleted; an operation that takes time
> > O(data * files)
> >
> > if you have more than one executor trying to commit the work
> simultaneously,
> > your output will be mess of both executions, without anything detecting
> and
> > reporting it.
> >
> > Where did you find this recommendation to set speculation=true?
> >
> > -Steve
> >
> > see also: https://issues.apache.org/jira/browse/SPARK-10063
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on AWS

2016-05-01 Thread Teng Qiu
Hi, here we made several optimizations for accessing s3 from spark:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando

such as:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133

you can deploy our spark package using our docker image, just simply:

docker run -d --net=host \
   -e START_MASTER="true" \
   -e START_WORKER="true" \
   -e START_WEBAPP="true" \
   -e START_NOTEBOOK="true" \
   registry.opensource.zalan.do/bi/spark:1.6.2-6


a jupyter notebook will running on port 


have fun

Best,

Teng

2016-04-29 12:37 GMT+02:00 Steve Loughran :
>
> On 28 Apr 2016, at 22:59, Alexander Pivovarov  wrote:
>
> Spark works well with S3 (read and write). However it's recommended to set
> spark.speculation true (it's expected that some tasks fail if you read large
> S3 folder, so speculation should help)
>
>
>
> I must disagree.
>
> Speculative execution has >1 executor running the query, with whoever
> finishes first winning.
> however, "finishes first" is implemented in the output committer, by
> renaming the attempt's output directory to the final output directory:
> whoever renames first wins.
> This relies on rename() being implemented in the filesystem client as an
> atomic transaction.
> Unfortunately, S3 doesn't do renames. Instead every file gets copied to one
> of the new name, then the old file deleted; an operation that takes time
> O(data * files)
>
> if you have more than one executor trying to commit the work simultaneously,
> your output will be mess of both executions, without anything detecting and
> reporting it.
>
> Where did you find this recommendation to set speculation=true?
>
> -Steve
>
> see also: https://issues.apache.org/jira/browse/SPARK-10063

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on AWS

2016-04-29 Thread Steve Loughran

On 28 Apr 2016, at 22:59, Alexander Pivovarov 
mailto:apivova...@gmail.com>> wrote:

Spark works well with S3 (read and write). However it's recommended to set 
spark.speculation true (it's expected that some tasks fail if you read large S3 
folder, so speculation should help)


I must disagree.


  1.  Speculative execution has >1 executor running the query, with whoever 
finishes first winning.
  2.  however, "finishes first" is implemented in the output committer, by 
renaming the attempt's output directory to the final output directory: whoever 
renames first wins.
  3.  This relies on rename() being implemented in the filesystem client as an 
atomic transaction.
  4.  Unfortunately, S3 doesn't do renames. Instead every file gets copied to 
one of the new name, then the old file deleted; an operation that takes time 
O(data * files)

if you have more than one executor trying to commit the work simultaneously, 
your output will be mess of both executions, without anything detecting and 
reporting it.

Where did you find this recommendation to set speculation=true?

-Steve

see also: https://issues.apache.org/jira/browse/SPARK-10063


Re: Spark on AWS

2016-04-28 Thread Fatma Ozcan
Thanks for the responses.
Fatma
On Apr 28, 2016 3:00 PM, "Renato Perini"  wrote:

> I have setup a small development cluster using t2.micro machines and an
> Amazon Linux AMI (CentOS 6.x).
> The whole setup has been done manually, without using the provided
> scripts. The whole setup is composed of a total of 5 instances: the first
> machine has an elastic IP and it is used as a bridge to access the other 4
> machines (they don't have elastic IPs). The second machine runs a
> standalone single node Spark cluster (1 master, 1 worker). The other 3
> machines are configured as an Apache Cassandra cluster. I have tuned the
> JVM and lots of parameters. I do not use S3 nor HDFS, I just write data
> using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra nodes,
> that I use for data retrieval. Data is then processed through regular Spark
> jobs.
> The jobs are submitted to the cluster using LinkedIn Azkaban, executing
> custom shell scripts written by me for wrapping the submitting process and
> handling eventual command line arguments, at scheduled intervals. Results
> are written directly to other Cassandra tables or in a specific folder on
> the filesystem using the regular CSV format.
> The system is completely autonomous and requires little to no manual
> administration.
>
> I'm quite satisfied with it, considering how small and limited the
> machines involved are. But it required lots of tuning work, because we are
> clearly under the recommended requirements. 4 of the 5 machines are
> switched off during the night, only the bridge machine is alive 24/7.
>
> 12$ per month in total.
>
> Renato Perini.
>
>
> Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
>
>> What is your experience using Spark on AWS? Are you setting up your own
>> Spark cluster, and using HDFS? Or are you using Spark as a service from
>> AWS? In the latter case, what is your experience of using S3 directly,
>> without having HDFS in between?
>>
>> Thanks,
>> Fatma
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on AWS

2016-04-28 Thread Renato Perini
I have setup a small development cluster using t2.micro machines and an 
Amazon Linux AMI (CentOS 6.x).
The whole setup has been done manually, without using the provided 
scripts. The whole setup is composed of a total of 5 instances: the 
first machine has an elastic IP and it is used as a bridge to access the 
other 4 machines (they don't have elastic IPs). The second machine runs 
a standalone single node Spark cluster (1 master, 1 worker). The other 3 
machines are configured as an Apache Cassandra cluster. I have tuned the 
JVM and lots of parameters. I do not use S3 nor HDFS, I just write data 
using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra 
nodes, that I use for data retrieval. Data is then processed through 
regular Spark jobs.
The jobs are submitted to the cluster using LinkedIn Azkaban, executing 
custom shell scripts written by me for wrapping the submitting process 
and handling eventual command line arguments, at scheduled intervals. 
Results are written directly to other Cassandra tables or in a specific 
folder on the filesystem using the regular CSV format.
The system is completely autonomous and requires little to no manual 
administration.


I'm quite satisfied with it, considering how small and limited the 
machines involved are. But it required lots of tuning work, because we 
are clearly under the recommended requirements. 4 of the 5 machines are 
switched off during the night, only the bridge machine is alive 24/7.


12$ per month in total.

Renato Perini.


Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
What is your experience using Spark on AWS? Are you setting up your 
own Spark cluster, and using HDFS? Or are you using Spark as a service 
from AWS? In the latter case, what is your experience of using S3 
directly, without having HDFS in between?


Thanks,
Fatma



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on AWS

2016-04-28 Thread Alexander Pivovarov
Fatima, the easiest way to create Spark cluster on AWS is to create EMR
cluster and select Spark application. (the latest EMR includes Spark 1.6.1)

Spark works well with S3 (read and write). However it's recommended to
set spark.speculation true (it's expected that some tasks fail if you read
large S3 folder, so speculation should help)



On Thu, Apr 28, 2016 at 2:39 PM, Fatma Ozcan  wrote:

> What is your experience using Spark on AWS? Are you setting up your own
> Spark cluster, and using HDFS? Or are you using Spark as a service from
> AWS? In the latter case, what is your experience of using S3 directly,
> without having HDFS in between?
>
> Thanks,
> Fatma
>


Spark on AWS

2016-04-28 Thread Fatma Ozcan
What is your experience using Spark on AWS? Are you setting up your own
Spark cluster, and using HDFS? Or are you using Spark as a service from
AWS? In the latter case, what is your experience of using S3 directly,
without having HDFS in between?

Thanks,
Fatma


spark.eventLog.enabled not working on spark on AWS EC2

2014-06-13 Thread zhen
I have been trying to record event logging in a standalone application
submitted to spark on AWS EC2. However, the application keeps failing when
trying to write to the event logs. I tried various different logging
directories by setting spark.eventLog.dir but it does not work.  I tried the
following directories (I made sure all the directories were created and had
the right permissions):

hdfs:///spark_logs
/root/spark_logs
hdfs://:8020/spark_logs

Nothing seems to work.

Can you give me some advise why it is not working?

Zhen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-eventLog-enabled-not-working-on-spark-on-AWS-EC2-tp7605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.