Spark on EMR

2015-06-16 Thread kamatsuoka
Spark is now officially supported on Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/details/spark/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Zeppelin + Spark on EMR

2015-09-07 Thread shahab
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the
script provided by Anders (
https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup
Zeppelin. The Zeppelin can connect to Spark but when I got error when I run
the tutorials. and I get the following error:

...FileNotFoundException: File
file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar
does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin
with EMR .

best,
/Shahab


Yarn Spark on EMR

2015-11-15 Thread SURAJ SHETH
Hi,
Yarn UI on 18080 stops receiving updates Spark jobs/tasks immediately after
it starts. We see only one task completed in the UI while the other hasn't
got any resources while in reality, more than 5 tasks would have completed.
Hadoop - Amazon 2.6
Spark - 1.5

Thanks and Regards,
Suraj Sheth


Running Spark on EMR

2017-01-15 Thread Marco Mistroni
hi all
 could anyone assist here?
i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues
connecting to the master node
So, below is a snippet of what i am doing


sc =
SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()

sparkHost is passed as input parameter. that was thought so that i can run
the script locally
on my spark local instance as well as submitting scripts on any cluster i
want


Now i have
1 - setup a cluster on EMR.
2 - connected to masternode
3  - launch the command spark-submit myscripts.py spark://master:7077

But that results in an connection refused exception
Then i have tried to remove the .master call above and launch the script
with the following command

spark-submit --master spark://master:7077   myscript.py  but still i am
getting
connectionREfused exception


I am using Spark 2.0.0 , could anyone advise on how shall i build the spark
session and how can i submit a pythjon script to the cluster?

kr
 marco


Re: Spark on EMR

2015-06-16 Thread ayan guha
That's great news. Can I assume spark on EMR supports kinesis to hbase
pipeline?
On 17 Jun 2015 05:29, "kamatsuoka"  wrote:

> Spark is now officially supported on Amazon Elastic Map Reduce:
> http://aws.amazon.com/elasticmapreduce/details/spark/
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on EMR

2015-06-17 Thread Hideyoshi Maeda
Any ideas what version of Spark is underneath?

i.e. is it 1.4? and is SparkR supported on Amazon EMR?

On Wed, Jun 17, 2015 at 12:06 AM, ayan guha  wrote:

> That's great news. Can I assume spark on EMR supports kinesis to hbase
> pipeline?
> On 17 Jun 2015 05:29, "kamatsuoka"  wrote:
>
>> Spark is now officially supported on Amazon Elastic Map Reduce:
>> http://aws.amazon.com/elasticmapreduce/details/spark/
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Spark on EMR

2015-06-17 Thread Eugen Cepoi
It looks like it is a wrapper around
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.

https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html

2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda :

> Any ideas what version of Spark is underneath?
>
> i.e. is it 1.4? and is SparkR supported on Amazon EMR?
>
> On Wed, Jun 17, 2015 at 12:06 AM, ayan guha  wrote:
>
>> That's great news. Can I assume spark on EMR supports kinesis to hbase
>> pipeline?
>> On 17 Jun 2015 05:29, "kamatsuoka"  wrote:
>>
>>> Spark is now officially supported on Amazon Elastic Map Reduce:
>>> http://aws.amazon.com/elasticmapreduce/details/spark/
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: Spark on EMR

2015-06-17 Thread Kelly, Jonathan
Yes, for now it is a wrapper around the old install-spark BA, but that will 
change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0 
was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming 
soon though, of course. Unfortunately, though install-spark is currently being 
used under the hood, passing "-v,1.4.0" in the options is not supported.

Sent from Nine<http://www.9folders.com/>

From: Eugen Cepoi 
Sent: Jun 17, 2015 6:37 AM
To: Hideyoshi Maeda
Cc: ayan guha;kamatsuoka;user
Subject: Re: Spark on EMR

It looks like it is a wrapper around 
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.

https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html

2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda 
mailto:hideyoshi.ma...@gmail.com>>:
Any ideas what version of Spark is underneath?

i.e. is it 1.4? and is SparkR supported on Amazon EMR?

On Wed, Jun 17, 2015 at 12:06 AM, ayan guha 
mailto:guha.a...@gmail.com>> wrote:

That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline?

On 17 Jun 2015 05:29, "kamatsuoka" mailto:ken...@gmail.com>> 
wrote:
Spark is now officially supported on Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/details/spark/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>





Re: Spark on EMR

2015-06-19 Thread Bozeman, Christopher
You can use Spark 1.4 on EMR AMI 3.8.0 if you install Spark as a 3rd party 
application using the bootstrap action directly without the native Spark 
inclusion with 1.3.1.  See 
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark

Refer to 
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md
 to determine the version build to install (for example, build 1.4.0.b includes 
sparkR and all builds include Kinesis).

-Christopher


From: Kelly, Jonathan
Sent: Wednesday, June 17, 2015 6:56 AM
To: Eugen Cepoi; Hideyoshi Maeda
Cc: ayan guha; kamatsuoka; user
Subject: Re: Spark on EMR

Yes, for now it is a wrapper around the old install-spark BA, but that will 
change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0 
was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming 
soon though, of course. Unfortunately, though install-spark is currently being 
used under the hood, passing "-v,1.4.0" in the options is not supported.

Sent from Nine<http://www.9folders.com/>

From: Eugen Cepoi mailto:cepoi.eu...@gmail.com>>
Sent: Jun 17, 2015 6:37 AM
To: Hideyoshi Maeda
Cc: ayan guha;kamatsuoka;user
Subject: Re: Spark on EMR

It looks like it is a wrapper around 
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.

https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html

2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda 
mailto:hideyoshi.ma...@gmail.com>>:
Any ideas what version of Spark is underneath?

i.e. is it 1.4? and is SparkR supported on Amazon EMR?

On Wed, Jun 17, 2015 at 12:06 AM, ayan guha 
mailto:guha.a...@gmail.com>> wrote:

That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline?
On 17 Jun 2015 05:29, "kamatsuoka" mailto:ken...@gmail.com>> 
wrote:
Spark is now officially supported on Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/details/spark/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com<http://Nabble.com>.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>





Question around spark on EMR

2016-04-05 Thread Natu Lauchande
Hi,

I am setting up a Scala spark streaming app in EMR . I wonder if anyone in
the list can help me with the following question :

1. What's the approach that you guys have been using  to submit in an EMR
job step environment variables that will be needed by the Spark application
?

2. Can i have multiple streamming apps in EMR ?

3. Is there any tool recommended for configuration management ( something
like Consult)


Thanks,
Natu


Run Apache Spark on EMR

2016-04-22 Thread Jinan Alhajjaj






Hi AllI would like to ask for two thing and I really appreciate the answer 
ASAP1. How do I implement the parallelism in Apache Spark java application?2. 
How to run the Spark application in Amazon EMR? 

  

RE: Yarn Spark on EMR

2015-11-20 Thread Bozeman, Christopher
Suraj,

Spark History server is running on 18080 
(http://spark.apache.org/docs/latest/monitoring.html) which is not going to 
give you are real-time update on a running Spark application.   Given this is 
Spark on YARN, you will need to view the Spark UI from the Application Master 
URL which can be found from the YARN Resource Manager UI (master node:8088) and 
it would be best to use a SOCKS proxy in order nicely resolve the URLs.

Best regards,
Christopher


From: SURAJ SHETH [mailto:shet...@gmail.com]
Sent: Sunday, November 15, 2015 8:19 AM
To: user@spark.apache.org
Subject: Yarn Spark on EMR

Hi,
Yarn UI on 18080 stops receiving updates Spark jobs/tasks immediately after it 
starts. We see only one task completed in the UI while the other hasn't got any 
resources while in reality, more than 5 tasks would have completed.
Hadoop - Amazon 2.6
Spark - 1.5

Thanks and Regards,
Suraj Sheth


Re: Running Spark on EMR

2017-01-15 Thread Neil Jonkers
Hello,

Can you drop the url:

 spark://master:7077

The url is used when running Spark in standalone mode.

Regards

 Original message From: Marco Mistroni 
 Date:15/01/2017  16:34  (GMT+02:00) 
To: User  Subject: Running Spark 
on EMR 
hi all
 could anyone assist here?
i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues 
connecting to the master node
So, below is a snippet of what i am doing


sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()

sparkHost is passed as input parameter. that was thought so that i can run the 
script locally
on my spark local instance as well as submitting scripts on any cluster i want


Now i have 
1 - setup a cluster on EMR. 
2 - connected to masternode
3  - launch the command spark-submit myscripts.py spark://master:7077

But that results in an connection refused exception
Then i have tried to remove the .master call above and launch the script with 
the following command

spark-submit --master spark://master:7077   myscript.py  but still i am getting
connectionREfused exception


I am using Spark 2.0.0 , could anyone advise on how shall i build the spark 
session and how can i submit a pythjon script to the cluster?

kr
 marco  

Re: Running Spark on EMR

2017-01-15 Thread Marco Mistroni
thanks Neil. I followed original suggestion from Andrw and everything is
working fine now
kr

On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers  wrote:

> Hello,
>
> Can you drop the url:
>
>  spark://master:7077
>
> The url is used when running Spark in standalone mode.
>
> Regards
>
>
>  Original message 
> From: Marco Mistroni
> Date:15/01/2017 16:34 (GMT+02:00)
> To: User
> Subject: Running Spark on EMR
>
> hi all
>  could anyone assist here?
> i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues
> connecting to the master node
> So, below is a snippet of what i am doing
>
>
> sc = SparkSession.builder.master(sparkHost).appName("
> DataProcess").getOrCreate()
>
> sparkHost is passed as input parameter. that was thought so that i can run
> the script locally
> on my spark local instance as well as submitting scripts on any cluster i
> want
>
>
> Now i have
> 1 - setup a cluster on EMR.
> 2 - connected to masternode
> 3  - launch the command spark-submit myscripts.py spark://master:7077
>
> But that results in an connection refused exception
> Then i have tried to remove the .master call above and launch the script
> with the following command
>
> spark-submit --master spark://master:7077   myscript.py  but still i am
> getting
> connectionREfused exception
>
>
> I am using Spark 2.0.0 , could anyone advise on how shall i build the
> spark session and how can i submit a pythjon script to the cluster?
>
> kr
>  marco
>


Re: Running Spark on EMR

2017-01-15 Thread Andrew Holway
Darn. I didn't respond to the list. Sorry.



On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni  wrote:

> thanks Neil. I followed original suggestion from Andrw and everything is
> working fine now
> kr
>
> On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers  wrote:
>
>> Hello,
>>
>> Can you drop the url:
>>
>>  spark://master:7077
>>
>> The url is used when running Spark in standalone mode.
>>
>> Regards
>>
>>
>>  Original message ----
>> From: Marco Mistroni
>> Date:15/01/2017 16:34 (GMT+02:00)
>> To: User
>> Subject: Running Spark on EMR
>>
>> hi all
>>  could anyone assist here?
>> i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues
>> connecting to the master node
>> So, below is a snippet of what i am doing
>>
>>
>> sc = SparkSession.builder.master(sparkHost).appName("DataProcess"
>> ).getOrCreate()
>>
>> sparkHost is passed as input parameter. that was thought so that i can
>> run the script locally
>> on my spark local instance as well as submitting scripts on any cluster i
>> want
>>
>>
>> Now i have
>> 1 - setup a cluster on EMR.
>> 2 - connected to masternode
>> 3  - launch the command spark-submit myscripts.py spark://master:7077
>>
>> But that results in an connection refused exception
>> Then i have tried to remove the .master call above and launch the script
>> with the following command
>>
>> spark-submit --master spark://master:7077   myscript.py  but still i am
>> getting
>> connectionREfused exception
>>
>>
>> I am using Spark 2.0.0 , could anyone advise on how shall i build the
>> spark session and how can i submit a pythjon script to the cluster?
>>
>> kr
>>  marco
>>
>
>


-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin


Re: Running Spark on EMR

2017-01-15 Thread Darren Govoni
So what was the answer?


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Andrew Holway 
 Date: 1/15/17  11:37 AM  (GMT-05:00) To: Marco 
Mistroni  Cc: Neil Jonkers , User 
 Subject: Re: Running Spark on EMR 
Darn. I didn't respond to the list. Sorry.


On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni  wrote:
thanks Neil. I followed original suggestion from Andrw and everything is 
working fine nowkr
On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers  wrote:
Hello,
Can you drop the url:
 spark://master:7077
The url is used when running Spark in standalone mode.
Regards

 Original message From: Marco Mistroni  Date:15/01/2017  16:34  
(GMT+02:00) To: User  Subject: Running Spark on EMR 
hi all could anyone assist here?i am trying to run spark 2.0.0 on an EMR 
cluster,but i am having issues connecting to the master nodeSo, below is a 
snippet of what i am doing

sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()

sparkHost is passed as input parameter. that was thought so that i can run the 
script locallyon my spark local instance as well as submitting scripts on any 
cluster i want

Now i have 1 - setup a cluster on EMR. 2 - connected to masternode3  - launch 
the command spark-submit myscripts.py spark://master:7077
But that results in an connection refused exceptionThen i have tried to remove 
the .master call above and launch the script with the following command
spark-submit --master spark://master:7077   myscript.py  but still i am 
gettingconnectionREfused exception

I am using Spark 2.0.0 , could anyone advise on how shall i build the spark 
session and how can i submit a pythjon script to the cluster?
kr marco  





-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin



Re: Running Spark on EMR

2017-01-15 Thread Andrew Holway
use yarn :)

"spark-submit --master yarn"

On Sun, Jan 15, 2017 at 7:55 PM, Darren Govoni  wrote:

> So what was the answer?
>
>
>
> Sent from my Verizon, Samsung Galaxy smartphone
>
>  Original message 
> From: Andrew Holway 
> Date: 1/15/17 11:37 AM (GMT-05:00)
> To: Marco Mistroni 
> Cc: Neil Jonkers , User 
> Subject: Re: Running Spark on EMR
>
> Darn. I didn't respond to the list. Sorry.
>
>
>
> On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni 
> wrote:
>
>> thanks Neil. I followed original suggestion from Andrw and everything is
>> working fine now
>> kr
>>
>> On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers 
>> wrote:
>>
>>> Hello,
>>>
>>> Can you drop the url:
>>>
>>>  spark://master:7077
>>>
>>> The url is used when running Spark in standalone mode.
>>>
>>> Regards
>>>
>>>
>>>  Original message 
>>> From: Marco Mistroni
>>> Date:15/01/2017 16:34 (GMT+02:00)
>>> To: User
>>> Subject: Running Spark on EMR
>>>
>>> hi all
>>>  could anyone assist here?
>>> i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues
>>> connecting to the master node
>>> So, below is a snippet of what i am doing
>>>
>>>
>>> sc = SparkSession.builder.master(sparkHost).appName("DataProcess"
>>> ).getOrCreate()
>>>
>>> sparkHost is passed as input parameter. that was thought so that i can
>>> run the script locally
>>> on my spark local instance as well as submitting scripts on any cluster
>>> i want
>>>
>>>
>>> Now i have
>>> 1 - setup a cluster on EMR.
>>> 2 - connected to masternode
>>> 3  - launch the command spark-submit myscripts.py spark://master:7077
>>>
>>> But that results in an connection refused exception
>>> Then i have tried to remove the .master call above and launch the script
>>> with the following command
>>>
>>> spark-submit --master spark://master:7077   myscript.py  but still i am
>>> getting
>>> connectionREfused exception
>>>
>>>
>>> I am using Spark 2.0.0 , could anyone advise on how shall i build the
>>> spark session and how can i submit a pythjon script to the cluster?
>>>
>>> kr
>>>  marco
>>>
>>
>>
>
>
> --
> Otter Networks UG
> http://otternetworks.de
> Gotenstraße 17
> 10829 Berlin
>



-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin


Re: Running Spark on EMR

2017-01-16 Thread Everett Anderson
On Sun, Jan 15, 2017 at 11:09 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:

> use yarn :)
>
> "spark-submit --master yarn"
>

Doesn't this require first copying out various Hadoop configuration XML
files from the EMR master node to the machine running the spark-submit? Or
is there a well-known minimal set of host/port options to avoid that?

I'm currently copying out several XML files and using them on a client
running spark-submit, but I feel uneasy about this as it seems like the
local values override values on the cluster at runtime -- they're copied up
with the job.




>
>
> On Sun, Jan 15, 2017 at 7:55 PM, Darren Govoni 
> wrote:
>
>> So what was the answer?
>>
>>
>>
>> Sent from my Verizon, Samsung Galaxy smartphone
>>
>>  Original message 
>> From: Andrew Holway 
>> Date: 1/15/17 11:37 AM (GMT-05:00)
>> To: Marco Mistroni 
>> Cc: Neil Jonkers , User 
>> Subject: Re: Running Spark on EMR
>>
>> Darn. I didn't respond to the list. Sorry.
>>
>>
>>
>> On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni 
>> wrote:
>>
>>> thanks Neil. I followed original suggestion from Andrw and everything is
>>> working fine now
>>> kr
>>>
>>> On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Can you drop the url:
>>>>
>>>>  spark://master:7077
>>>>
>>>> The url is used when running Spark in standalone mode.
>>>>
>>>> Regards
>>>>
>>>>
>>>>  Original message 
>>>> From: Marco Mistroni
>>>> Date:15/01/2017 16:34 (GMT+02:00)
>>>> To: User
>>>> Subject: Running Spark on EMR
>>>>
>>>> hi all
>>>>  could anyone assist here?
>>>> i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues
>>>> connecting to the master node
>>>> So, below is a snippet of what i am doing
>>>>
>>>>
>>>> sc = SparkSession.builder.master(sparkHost).appName("DataProcess"
>>>> ).getOrCreate()
>>>>
>>>> sparkHost is passed as input parameter. that was thought so that i can
>>>> run the script locally
>>>> on my spark local instance as well as submitting scripts on any cluster
>>>> i want
>>>>
>>>>
>>>> Now i have
>>>> 1 - setup a cluster on EMR.
>>>> 2 - connected to masternode
>>>> 3  - launch the command spark-submit myscripts.py spark://master:7077
>>>>
>>>> But that results in an connection refused exception
>>>> Then i have tried to remove the .master call above and launch the
>>>> script with the following command
>>>>
>>>> spark-submit --master spark://master:7077   myscript.py  but still i
>>>> am getting
>>>> connectionREfused exception
>>>>
>>>>
>>>> I am using Spark 2.0.0 , could anyone advise on how shall i build the
>>>> spark session and how can i submit a pythjon script to the cluster?
>>>>
>>>> kr
>>>>  marco
>>>>
>>>
>>>
>>
>>
>> --
>> Otter Networks UG
>> http://otternetworks.de
>> Gotenstraße 17
>> 10829 Berlin
>>
>
>
>
> --
> Otter Networks UG
> http://otternetworks.de
> Gotenstraße 17
> 10829 Berlin
>


Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
Dear Sparkers,

Once again in times of desperation, I leave what remains of my mental sanity to 
this wise and knowledgeable community.

I have a Spark job (on EMR 5.8.0) which had been running daily for months, if 
not the whole year, with absolutely no supervision. This changed all of sudden 
for reasons I do not understand.

The volume of data processed daily has been slowly increasing over the past 
year but has been stable in the last couple months. Since I'm only processing 
the past 8 days's worth of data I do not think that increased data volume is to 
blame here. Yes, I did check the volume of data for the past few days.

Here is a short description of the issue.

- The Spark job starts normally and proceeds successfully with the first few 
stages.
- Once we reach the dreaded stage, all tasks are performed successfully (they 
typically take not more than 1 minute each), except for the /very/ first one 
(task 0.0) which never finishes.

Here is what the log looks like (simplified for readability):


INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412 ms on 
... (executor 12) (254/256)
INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394 ms on 
... (executor 7) (255/256)
INFO ExecutorAllocationManager: Request to remove executorIds: 14
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 14
INFO YarnAllocator: Driver requested a total number of 0 executor(s).


Why is that? There is still a task waiting to be completed right? Isn't an 
executor needed for that?

Afterwards, all executors are getting killed (dynamic allocation is turned on):


INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
INFO ExecutorAllocationManager: Removing executor 14 because it has been idle 
for 60 seconds (new desired total will be 5)
.
.
.
INFO ExecutorAllocationManager: Request to remove executorIds: 7
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 7
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
INFO ExecutorAllocationManager: Removing executor 7 because it has been idle 
for 60 seconds (new desired total will be 1)
INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
INFO DAGScheduler: Executor lost: 7 (epoch 4)
INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from 
BlockManagerMaster.
INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, ..., 
44289, None)
INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
INFO ExecutorAllocationManager: Existing executor 7 has been removed (new total 
is 1)


Then, there's nothing more in the driver's log. Nothing. The cluster then run 
for hours, with no progress being made, and no executors allocated.

Here is what I tried:

- More memory per executor: from 13 GB to 24 GB by increments.
- Explicit repartition() on the RDD: from 128 to 256 partitions.

The offending stage used to be a rather innocent looking keyBy(). After adding 
some repartition() the offending stage was then a mapToPair(). During my last 
experiments, it turned out the repartition(256) itself is now the culprit.

I like Spark, but its mysteries will manage to send me in a mental hospital one 
of those days.

Can anyone shed light on what is going on here, or maybe offer some suggestions 
or pointers to relevant source of information?

I am completely clueless.

Seasons greetings,

Jeroen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 17:41, Richard Qiao  wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the 
driver's log when I check it from the AWS console when the cluster is still 
running. 

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Patrick Alwell
Joren,

Anytime there is a shuffle in the network, Spark moves to a new stage. It seems 
like you are having issues either pre or post shuffle. Have you looked at a 
resource management tool like ganglia to determine if this is a memory or 
thread related issue? The spark UI?

You are using groupByKey() have you thought of an alternative like 
aggregateByKey() or combineByKey() to reduce shuffling?
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/avoid-groupbykey-when-performing-a-group-of-multiple-items-by-key.html

Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
num executors, cores per executor, and memory per executor to be a better 
alternative.

Take a look at the yarn logs as well for the particular executor in question. 
Executors can have multiple tasks; and will often fail if they have more tasks 
than available threads.

As for partitioning the data; you could also look into your level of 
parallelism which is correlated to the splitablity (blocks) of data. This will 
be based on your largest RDD.
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism

Spark is like C/C++ you need to manage the memory buffer or the compiler will 
through you out  ;)
https://spark.apache.org/docs/latest/hardware-provisioning.html

Hang in there, this is the more complicated stage of placing a spark 
application into production. The Yarn logs should point you in the right 
direction.

It’s tough to debug over email, so hopefully this information is helpful.

-Pat


On 12/28/17, 9:57 AM, "Jeroen Miller"  wrote:

On 28 Dec 2017, at 17:41, Richard Qiao  wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the 
driver's log when I check it from the AWS console when the cluster is still 
running. 

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Maximiliano Felice
Hi Jeroen,

I experienced a similar issue a few weeks ago. The situation was a result
of a mix of speculative execution and OOM issues in the container.

First of all, when an executor takes too much time in Spark, it is handled
by the YARN speculative execution, which will launch a new executor and
allocate it in a new container. In our case, some tasks were throwing OOM
exceptions while executing, but not on the executor itself, *but on the
YARN container.*

It turns out that YARN will try several times to run an application when
something fails in it. Specifically, it will try
*yarn.resourcemanager.am.max-attempts* times to run the application before
failing, which has a default value of 2 and is not modified in EMR
configurations (check here

).

We've managed to check that when we have speculative execution enabled and
some YARN containers which were running speculative tasks died, they did
take a chance from the *max-attempts *number. This wouldn't represent any
issue in normal behavior, but it seems that if all the retries were
consumed in a task that has started speculative execution, the application
itself doesn't fail, but it hangs the task expecting to reschedule it
sometime. As the attempts are zero, it never reschedules it and the
application itself fails to finish.

I checked this theory repeatedly, always getting the expected results.
Several times I changed the named YARN configuration and it always starts
speculative retries on this task and hangs when reaching max-attempts
number of broken YARN containers.

I personally think that this issue should be possible to reproduce without
the speculative execution configured.

So, what would I do if I were you?

1. Check the number of tasks scheduled. If you see one (or more) tasks
missing when you do the final sum, then you might be encountering this
issue.
2. Check the *container* logs to see if anything broke. OOM is what failed
to me.
3. Contact AWS EMR support, although in my experience they were of no help
at all.


Hope this helps you a bit!



2017-12-28 14:57 GMT-03:00 Jeroen Miller :

> On 28 Dec 2017, at 17:41, Richard Qiao  wrote:
> > Are you able to specify which path of data filled up?
>
> I can narrow it down to a bunch of files but it's not so straightforward.
>
> > Any logs not rolled over?
>
> I have to manually terminate the cluster but there is nothing more in the
> driver's log when I check it from the AWS console when the cluster is still
> running.
>
> JM
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on EMR suddenly stalling

2017-12-28 Thread Gourav Sengupta
HI Jeroen,

Can I get a few pieces of additional information please?

In the EMR cluster what are the other applications that you have enabled
(like HIVE, FLUME, Livy, etc).
Are you using SPARK Session? If yes is your application using cluster mode
or client mode?
Have you read the EC2 service level agreement?
Is your cluster on auto scaling group?
Are you scheduling your job by adding another new step into the EMR
cluster? Or is it the same job running always triggered by some background
process?
Since EMR are supposed to be ephemeral, have you tried creating a new
cluster and trying your job in that?


Regards,
Gourav Sengupta

On Thu, Dec 28, 2017 at 4:06 PM, Jeroen Miller 
wrote:

> Dear Sparkers,
>
> Once again in times of desperation, I leave what remains of my mental
> sanity to this wise and knowledgeable community.
>
> I have a Spark job (on EMR 5.8.0) which had been running daily for months,
> if not the whole year, with absolutely no supervision. This changed all of
> sudden for reasons I do not understand.
>
> The volume of data processed daily has been slowly increasing over the
> past year but has been stable in the last couple months. Since I'm only
> processing the past 8 days's worth of data I do not think that increased
> data volume is to blame here. Yes, I did check the volume of data for the
> past few days.
>
> Here is a short description of the issue.
>
> - The Spark job starts normally and proceeds successfully with the first
> few stages.
> - Once we reach the dreaded stage, all tasks are performed successfully
> (they typically take not more than 1 minute each), except for the /very/
> first one (task 0.0) which never finishes.
>
> Here is what the log looks like (simplified for readability):
>
> 
> INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412
> ms on ... (executor 12) (254/256)
> INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394
> ms on ... (executor 7) (255/256)
> INFO ExecutorAllocationManager: Request to remove executorIds: 14
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 14
> INFO YarnAllocator: Driver requested a total number of 0 executor(s).
> 
>
> Why is that? There is still a task waiting to be completed right? Isn't an
> executor needed for that?
>
> Afterwards, all executors are getting killed (dynamic allocation is turned
> on):
>
> 
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
> INFO ExecutorAllocationManager: Removing executor 14 because it has been
> idle for 60 seconds (new desired total will be 5)
> .
> .
> .
> INFO ExecutorAllocationManager: Request to remove executorIds: 7
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 7
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
> INFO ExecutorAllocationManager: Removing executor 7 because it has been
> idle for 60 seconds (new desired total will be 1)
> INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> INFO DAGScheduler: Executor lost: 7 (epoch 4)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from
> BlockManagerMaster.
> INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7,
> ..., 44289, None)
> INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
> INFO ExecutorAllocationManager: Existing executor 7 has been removed (new
> total is 1)
> 
>
> Then, there's nothing more in the driver's log. Nothing. The cluster then
> run for hours, with no progress being made, and no executors allocated.
>
> Here is what I tried:
>
> - More memory per executor: from 13 GB to 24 GB by increments.
> - Explicit repartition() on the RDD: from 128 to 256 partitions.
>
> The offending stage used to be a rather innocent looking keyBy(). After
> adding some repartition() the offending stage was then a mapToPair().
> During my last experiments, it turned out the repartition(256) itself is
> now the culprit.
>
> I like Spark, but its mysteries will manage to send me in a mental
> hospital one of those days.
>
> Can anyone shed light on what is going on here, or maybe offer some
> suggestions or pointers to relevant source of information?
>
> I am completely clueless.
>
> Seasons greetings,
>
> Jeroen
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Fwd: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell  wrote:
> You are using groupByKey() have you thought of an alternative like 
> aggregateByKey() or combineByKey() to reduce shuffling?

I am aware of this indeed. I do have a groupByKey() that is difficult to avoid, 
but the problem occurs afterwards.

> Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
> num executors, cores per executor, and memory per executor to be a better 
> alternative.

I will try with dynamic allocation off.

> Take a look at the yarn logs as well for the particular executor in question. 
> Executors can have multiple tasks; and will often fail if they have more 
> tasks than available threads.

The trouble is there is nothing significant in the logs (read: that is clear 
enough for me to understand!). Any special message I could grep for?

> [...] https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
> [...] https://spark.apache.org/docs/latest/hardware-provisioning.html

Thanks for the pointers -- will have a look!

Jeroen



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:40, Maximiliano Felice  
wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of 
> a mix of speculative execution and OOM issues in the container.

Interesting! However I don't have any OOM exception in the logs. Does that rule 
out your hypothesis?

> We've managed to check that when we have speculative execution enabled and 
> some YARN containers which were running speculative tasks died, they did take 
> a chance from the max-attempts number. This wouldn't represent any issue in 
> normal behavior, but it seems that if all the retries were consumed in a task 
> that has started speculative execution, the application itself doesn't fail, 
> but it hangs the task expecting to reschedule it sometime. As the attempts 
> are zero, it never reschedules it and the application itself fails to finish.

Hmm, this sounds like a huge design fail to me, but I'm sure there are very 
complicated issues that go way over my head.

> 1. Check the number of tasks scheduled. If you see one (or more) tasks 
> missing when you do the final sum, then you might be encountering this issue.
> 2. Check the container logs to see if anything broke. OOM is what failed to 
> me.

I can't find anything in the logs from EMR. Should I expect to find explicit 
OOM exception messages? 

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:42, Gourav Sengupta  wrote:
> In the EMR cluster what are the other applications that you have enabled 
> (like HIVE, FLUME, Livy, etc).

Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff 
behind my back).

> Are you using SPARK Session?

Yes.

> If yes is your application using cluster mode or client mode?

Cluster mode.

> Have you read the EC2 service level agreement?

I did not -- I doubt it has the answer to my problem though! :-)

> Is your cluster on auto scaling group?

Nope.

> Are you scheduling your job by adding another new step into the EMR cluster? 
> Or is it the same job running always triggered by some background process?
> Since EMR are supposed to be ephemeral, have you tried creating a new cluster 
> and trying your job in that?

I'm creating a new cluster on demand, specifically for that job. No other 
application runs on it.

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Gourav Sengupta
Hi Jeroen,

can you try to then use the EMR version 5.10 instead or EMR version 5.11
instead?
can you please try selecting a subnet which is in a different availability
zone?
if possible just try to increase the number of task instances and see the
difference?
also in case you are using caching, try to see the total amount of space
being used, you may also want to persist intermediate data into S3 as
default parquet format in worst case scenario and then work through the
steps that you think are failing using Jupyter or Spark notebook.
Also can you please report the number of containers that your job is
creating by looking at the metrics in the EMR console?

Also if you see the spark UI then you can easily see which particular step
is taking the longest period of time - you just have to drill in a bit in
order to see that. Generally in case shuffling is an issue then it
definitely appears in the SPARK UI as I drill into the steps and see which
particular one is taking the longest.


Since you do not have a long running cluster (which I mistook from your
statement of a long running job) therefore things should be fine.


Regards,
Gourav Sengupta


On Thu, Dec 28, 2017 at 7:43 PM, Jeroen Miller 
wrote:

> On 28 Dec 2017, at 19:42, Gourav Sengupta 
> wrote:
> > In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
>
> Nothing that I can think of, just a Spark step (unless EMR is doing fancy
> stuff behind my back).
>
> > Are you using SPARK Session?
>
> Yes.
>
> > If yes is your application using cluster mode or client mode?
>
> Cluster mode.
>
> > Have you read the EC2 service level agreement?
>
> I did not -- I doubt it has the answer to my problem though! :-)
>
> > Is your cluster on auto scaling group?
>
> Nope.
>
> > Are you scheduling your job by adding another new step into the EMR
> cluster? Or is it the same job running always triggered by some background
> process?
> > Since EMR are supposed to be ephemeral, have you tried creating a new
> cluster and trying your job in that?
>
> I'm creating a new cluster on demand, specifically for that job. No other
> application runs on it.
>
> JM
>
>


Fwd: Spark on EMR suddenly stalling

2017-12-29 Thread Jeroen Miller
Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta  wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 
> instead? 

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability 
> zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the 
> difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating 
> by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_xxx/.

> Also if you see the spark UI then you can easily see which particular step is 
> taking the longest period of time - you just have to drill in a bit in order 
> to see that. Generally in case shuffling is an issue then it definitely 
> appears in the SPARK UI as I drill into the steps and see which particular 
> one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to 
date.

JM



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-29 Thread Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell  wrote:
> Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
> num executors, cores per executor, and memory per executor to be a better 
> alternative.

No difference with spark.dynamicAllocation.enabled set to false.

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-29 Thread Shushant Arora
you may have to recreate your cluster with below configuration at emr
creation
"Configurations": [
{
"Properties": {
"maximizeResourceAllocation": "false"
},
"Classification": "spark"
}
]

On Fri, Dec 29, 2017 at 11:57 PM, Jeroen Miller 
wrote:

> On 28 Dec 2017, at 19:25, Patrick Alwell  wrote:
> > Dynamic allocation is great; but sometimes I’ve found explicitly setting
> the num executors, cores per executor, and memory per executor to be a
> better alternative.
>
> No difference with spark.dynamicAllocation.enabled set to false.
>
> JM
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on EMR suddenly stalling

2017-12-30 Thread Gourav Sengupta
Hi,

Please try to use the SPARK UI from the way that AWS EMR recommends, it
should be available from the resource manager. I never ever had any problem
working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
DEBUGGING.

Sadly, I cannot be of much help unless we go for a screen share session
over google chat or skype.

Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be
set to true.

Besides that, there is a metrics in the EMR console which shows the number
of containers getting generated by your job on graphs.



Regards,
Gourav Sengupta

On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller 
wrote:

> Hello,
>
> Just a quick update as I did not made much progress yet.
>
> On 28 Dec 2017, at 21:09, Gourav Sengupta 
> wrote:
> > can you try to then use the EMR version 5.10 instead or EMR version 5.11
> instead?
>
> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>
> > can you please try selecting a subnet which is in a different
> availability zone?
>
> I did not try this yet. But why should that make a difference?
>
> > if possible just try to increase the number of task instances and see
> the difference?
>
> I tried with 512 partitions -- no difference.
>
> > also in case you are using caching,
>
> No caching used.
>
> > Also can you please report the number of containers that your job is
> creating by looking at the metrics in the EMR console?
>
> 8 containers if I trust the directories in j-xxx/containers/application_
> xxx/.
>
> > Also if you see the spark UI then you can easily see which particular
> step is taking the longest period of time - you just have to drill in a bit
> in order to see that. Generally in case shuffling is an issue then it
> definitely appears in the SPARK UI as I drill into the steps and see which
> particular one is taking the longest.
>
> I always have issues with the Spark UI on EC2 -- it never seems to be up
> to date.
>
> JM
>
>


Re: Spark on EMR suddenly stalling

2018-01-01 Thread Rohit Karlupia
Here is the list that I will probably try to fill:

   1. Check GC on the offending executor when the task is running. May be
   you need even more memory.
   2. Go back to some previous successful run of the job and check the
   spark ui for the offending stage and check max task time/max input/max
   shuffle in/out for the largest task. Will help you understand the degree of
   skew in this stage.
   3. Take a thread dump of the executor from the Spark UI and verify if
   the task is really doing any work or it stuck in some deadlock. Some of the
   hive serde are not really usable from multi-threaded/multi-use spark
   executors.
   4. Take a thread dump of the executor from the Spark UI and verify if
   the task is spilling to disk. Playing with storage and memory fraction or
   generally increasing the memory will help.
   5. Check the disk utilisation on the machine running the executor.
   6. Look for event loss messages in the logs due to event queue full.
   Loss of events can send some of the spark components into really bad
   states.


thanks,
rohitk



On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta  wrote:

> Hi,
>
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> Sadly, I cannot be of much help unless we go for a screen share session
> over google chat or skype.
>
> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Besides that, there is a metrics in the EMR console which shows the number
> of containers getting generated by your job on graphs.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller 
> wrote:
>
>> Hello,
>>
>> Just a quick update as I did not made much progress yet.
>>
>> On 28 Dec 2017, at 21:09, Gourav Sengupta 
>> wrote:
>> > can you try to then use the EMR version 5.10 instead or EMR version
>> 5.11 instead?
>>
>> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>>
>> > can you please try selecting a subnet which is in a different
>> availability zone?
>>
>> I did not try this yet. But why should that make a difference?
>>
>> > if possible just try to increase the number of task instances and see
>> the difference?
>>
>> I tried with 512 partitions -- no difference.
>>
>> > also in case you are using caching,
>>
>> No caching used.
>>
>> > Also can you please report the number of containers that your job is
>> creating by looking at the metrics in the EMR console?
>>
>> 8 containers if I trust the directories in j-xxx/containers/application_x
>> xx/.
>>
>> > Also if you see the spark UI then you can easily see which particular
>> step is taking the longest period of time - you just have to drill in a bit
>> in order to see that. Generally in case shuffling is an issue then it
>> definitely appears in the SPARK UI as I drill into the steps and see which
>> particular one is taking the longest.
>>
>> I always have issues with the Spark UI on EC2 -- it never seems to be up
>> to date.
>>
>> JM
>>
>>
>


Re: Spark on EMR suddenly stalling

2018-01-01 Thread M Singh
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input 
source and output sink ?  
In some cases, I found that saving to S3 was a problem. In this case I started 
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which 
solved our issue.

Mans 

On Monday, January 1, 2018 7:41 AM, Rohit Karlupia  
wrote:
 

 Here is the list that I will probably try to fill:   
   - Check GC on the offending executor when the task is running. May be you 
need even more memory.  
   - Go back to some previous successful run of the job and check the spark ui 
for the offending stage and check max task time/max input/max shuffle in/out 
for the largest task. Will help you understand the degree of skew in this 
stage. 
   - Take a thread dump of the executor from the Spark UI and verify if the 
task is really doing any work or it stuck in some deadlock. Some of the hive 
serde are not really usable from multi-threaded/multi-use spark executors. 
   - Take a thread dump of the executor from the Spark UI and verify if the 
task is spilling to disk. Playing with storage and memory fraction or generally 
increasing the memory will help. 
   - Check the disk utilisation on the machine running the executor. 
   - Look for event loss messages in the logs due to event queue full. Loss of 
events can send some of the spark components into really bad states.  

thanks,rohitk


On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta  
wrote:

Hi,
Please try to use the SPARK UI from the way that AWS EMR recommends, it should 
be available from the resource manager. I never ever had any problem working 
with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.
Sadly, I cannot be of much help unless we go for a screen share session over 
google chat or skype. 
Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set 
to true. 
Besides that, there is a metrics in the EMR console which shows the number of 
containers getting generated by your job on graphs.


Regards,Gourav Sengupta
On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller  wrote:

Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta  wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 
> instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability 
> zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the 
> difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating 
> by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_x xx/.

> Also if you see the spark UI then you can easily see which particular step is 
> taking the longest period of time - you just have to drill in a bit in order 
> to see that. Generally in case shuffling is an issue then it definitely 
> appears in the SPARK UI as I drill into the steps and see which particular 
> one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to 
date.

JM







   

Re: Spark on EMR suddenly stalling

2018-01-01 Thread Jeroen Miller
Hello Gourav,

On 30 Dec 2017, at 20:20, Gourav Sengupta  wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it 
> should be available from the resource manager. I never ever had any problem 
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

For some reason sometimes there is absolutely nothing showing up in the Spark 
UI or the UI is not refreshed, e.g. for the current stage is #x while the logs 
shows stage #y (with y > x) is currently under way.

It may very well be that the source of this problem lies between the keyboard 
and the chair, but if this is the case, I do not know how to solve this.

> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be 
> set to true. 

Thanks for the tip -- will try this setting in my next batch of experiments!

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2018-01-01 Thread Jeroen Miller
Hello Mans,

On 1 Jan 2018, at 17:12, M Singh  wrote:
> I am not sure if I missed it - but can you let us know what is your input 
> source and output sink ?

Reading from S3 and writing to S3.

However the never-ending task 0.0 happens in a stage way before outputting 
anything to S3.

Regards,

Jeroen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2018-01-02 Thread Gourav Sengupta
Hi Jeroen,

in case you are using HIVE partitions how many partitions do you have?

Also is there any chance that you might post the code?

Regards,
Gourav Sengupta

On Tue, Jan 2, 2018 at 7:50 AM, Jeroen Miller 
wrote:

> Hello Gourav,
>
> On 30 Dec 2017, at 20:20, Gourav Sengupta 
> wrote:
> > Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> For some reason sometimes there is absolutely nothing showing up in the
> Spark UI or the UI is not refreshed, e.g. for the current stage is #x while
> the logs shows stage #y (with y > x) is currently under way.
>
> It may very well be that the source of this problem lies between the
> keyboard and the chair, but if this is the case, I do not know how to solve
> this.
>
> > Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Thanks for the tip -- will try this setting in my next batch of
> experiments!
>
> JM
>
>


Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Is there an example about how to load data from a public S3 bucket in Python? I 
haven't found any.

Thank you,



Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Sujit Pal
Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
aws_secret_key)

mydata = sc.textFile("s3n://mybucket/my_input_folder") \
.map(lambda x: do_something(x)) \
.saveAsTextFile("s3://mybucket/my_output_folder")
...

You can read and write sequence files as well - these are the only 2
formats I have tried, but I'm sure the other ones like JSON would work
also. Another approach is to embed the AWS access key and secret key into
the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
an older version but not sure) but it works for access.

Hope this helps,
Sujit


On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto  wrote:

> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>


RE: Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide 
the keys?

Thank you,


From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: Spark on EMR with S3 example (Python)

Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should be 
similar for public S3 buckets as well. You need to set the AWS access and 
secret keys into the SparkContext, then you can access the S3 folders and files 
with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_key)

mydata = sc.textFile("s3n://mybucket/my_input_folder") \
.map(lambda x: do_something(x)) \
.saveAsTextFile("s3://mybucket/my_output_folder")
...

You can read and write sequence files as well - these are the only 2 formats I 
have tried, but I'm sure the other ones like JSON would work also. Another 
approach is to embed the AWS access key and secret key into the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its an 
older version but not sure) but it works for access.

Hope this helps,
Sujit


On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto 
mailto:rpagli...@appcomsci.com>> wrote:
Is there an example about how to load data from a public S3 bucket in Python? I 
haven’t found any.

Thank you,




Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Akhil Das
I think any requests going to s3*:// requires the credentials. If they have
made it public (via http) then you won't require the keys.

Thanks
Best Regards

On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto 
wrote:

> Hi Sujit,
>
> I just wanted to access public datasets on Amazon. Do I still need to
> provide the keys?
>
>
>
> Thank you,
>
>
>
>
>
> *From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
> *Sent:* Tuesday, July 14, 2015 3:14 PM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on EMR with S3 example (Python)
>
>
>
> Hi Roberto,
>
>
>
> I have written PySpark code that reads from private S3 buckets, it should
> be similar for public S3 buckets as well. You need to set the AWS access
> and secret keys into the SparkContext, then you can access the S3 folders
> and files with their s3n:// paths. Something like this:
>
>
>
> sc = SparkContext()
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> aws_secret_key)
>
>
>
> mydata = sc.textFile("s3n://mybucket/my_input_folder") \
>
> .map(lambda x: do_something(x)) \
>
> .saveAsTextFile("s3://mybucket/my_output_folder")
>
> ...
>
>
>
> You can read and write sequence files as well - these are the only 2
> formats I have tried, but I'm sure the other ones like JSON would work
> also. Another approach is to embed the AWS access key and secret key into
> the s3n:// path.
>
>
>
> I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
> an older version but not sure) but it works for access.
>
>
>
> Hope this helps,
>
> Sujit
>
>
>
>
>
> On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <
> rpagli...@appcomsci.com> wrote:
>
> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>
>
>


Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Sujit Pal
Hi Roberto,

I think you would need to as Akhil said. Just checked from this page:

http://aws.amazon.com/public-data-sets/

and clicking through to a few dataset links, all of them are available on
s3 (some are available via http and ftp, but I think the point of these
datasets are that they are usually very large so having it on s3 ensures
that its easier to take your code to it than bring the datasets to your
code.

-sujit


On Tue, Jul 14, 2015 at 1:56 PM, Pagliari, Roberto 
wrote:

> Hi Sujit,
>
> I just wanted to access public datasets on Amazon. Do I still need to
> provide the keys?
>
>
>
> Thank you,
>
>
>
>
>
> *From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
> *Sent:* Tuesday, July 14, 2015 3:14 PM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on EMR with S3 example (Python)
>
>
>
> Hi Roberto,
>
>
>
> I have written PySpark code that reads from private S3 buckets, it should
> be similar for public S3 buckets as well. You need to set the AWS access
> and secret keys into the SparkContext, then you can access the S3 folders
> and files with their s3n:// paths. Something like this:
>
>
>
> sc = SparkContext()
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> aws_secret_key)
>
>
>
> mydata = sc.textFile("s3n://mybucket/my_input_folder") \
>
> .map(lambda x: do_something(x)) \
>
> .saveAsTextFile("s3://mybucket/my_output_folder")
>
> ...
>
>
>
> You can read and write sequence files as well - these are the only 2
> formats I have tried, but I'm sure the other ones like JSON would work
> also. Another approach is to embed the AWS access key and secret key into
> the s3n:// path.
>
>
>
> I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
> an older version but not sure) but it works for access.
>
>
>
> Hope this helps,
>
> Sujit
>
>
>
>
>
> On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <
> rpagli...@appcomsci.com> wrote:
>
> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>
>
>


Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-10 Thread Roberto Coluccio
Hello,

I'm investigating on a solution to real-time monitor Spark logs produced by
my EMR cluster in order to collect statistics and trigger alarms. Being on
EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, since
I'm on AWS, those service are pretty well integrated together..but I could
just find examples about it using on standalone EC2 instances.

In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
mode), I would like to be able to real-time monitor Spark logs, so not just
about when the processing ends and they are copied to S3. Is there any
out-of-the-box solution or best-practice for accomplish this goal when
running on EMR that I'm not aware of?

Spark logs are written on the Data Nodes (Core Instances) local file
systems as YARN containers logs, so probably installing the awslogs agent
on them and pointing to those logfiles would help pushing such logs on
CloudWatch, but I was wondering how the community real-time monitors
application logs when running Spark on YARN on EMR.

Or maybe I'm looking at a wrong solution. Maybe the correct way would be
using something like a CloudwatchSink so to make Spark (log4j) pushing logs
directly to the sink and the sink pushing them to CloudWatch (I do like the
out-of-the-box EMR logging experience and I want to keep the usual eventual
logs archiving on S3 when the EMR cluster is terminated).

Any ideas or experience about this problem?

Thank you.

Roberto


Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-10 Thread Steve Loughran

> On 10 Dec 2015, at 14:52, Roberto Coluccio  wrote:
> 
> Hello,
> 
> I'm investigating on a solution to real-time monitor Spark logs produced by 
> my EMR cluster in order to collect statistics and trigger alarms. Being on 
> EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, since 
> I'm on AWS, those service are pretty well integrated together..but I could 
> just find examples about it using on standalone EC2 instances.
> 
> In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster 
> mode), I would like to be able to real-time monitor Spark logs, so not just 
> about when the processing ends and they are copied to S3. Is there any 
> out-of-the-box solution or best-practice for accomplish this goal when 
> running on EMR that I'm not aware of?
> 
> Spark logs are written on the Data Nodes (Core Instances) local file systems 
> as YARN containers logs, so probably installing the awslogs agent on them and 
> pointing to those logfiles would help pushing such logs on CloudWatch, but I 
> was wondering how the community real-time monitors application logs when 
> running Spark on YARN on EMR.
> 
> Or maybe I'm looking at a wrong solution. Maybe the correct way would be 
> using something like a CloudwatchSink so to make Spark (log4j) pushing logs 
> directly to the sink and the sink pushing them to CloudWatch (I do like the 
> out-of-the-box EMR logging experience and I want to keep the usual eventual 
> logs archiving on S3 when the EMR cluster is terminated).
> 
> Any ideas or experience about this problem?
> 
> Thank you.
> 
> Roberto


are you talking about event logs as used by the history server, or application 
logs?

the current spark log server writes events to a file, but as the hadoop s3 fs 
client doesn't write except in close(), they won't be pushed out while thing 
are running. Someone (you?) could have a go at implementing a new event 
listener; some stuff that will come out in Spark 2.0 will make it easier to 
wire this up (SPARK-11314), which is coming as part of some work on spark-YARN 
timelineserver itnegration.

In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn 
Nodemanagers and automatically copied out, look at 
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For that to 
work you need to set up your log wildcard patterns to for the NM to locate 
(i.e. have rolling logs with the right extensions)...the details escape me 
right now

In earlier versions, you can use "yarn logs' to grab them and pull them down.

I don't know anything about cloudwatch integration, sorry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-11 Thread Roberto Coluccio
Thanks for your advice, Steve.

I'm mainly talking about application logs. To be more clear, just for
instance think about the
"//hadoop/userlogs/application_blablabla/container_blablabla/stderr_or_stdout".
So YARN's applications containers logs, stored (at least for EMR's hadoop
2.4) on DataNodes and aggregated/pushed only once the application completes.

"yarn logs" issued from the cluster Master doesn't allow you to on-demand
aggregate logs for applications the are in running/active state.

For now I managed to install the awslogs agent (
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/CWL_GettingStarted.html)
on
DataNodes so to push containers logs in real-time to CloudWatch logs, but
that's kinda of a workaround too, this is why I was wondering what the
community (in general, not only on EMR) uses to real-time monitor
application logs (in an automated fashion) for long-running processes like
streaming driver and if are there out-of-the-box solutions.

Thanks,

Roberto





On Thu, Dec 10, 2015 at 3:06 PM, Steve Loughran 
wrote:

>
> > On 10 Dec 2015, at 14:52, Roberto Coluccio 
> wrote:
> >
> > Hello,
> >
> > I'm investigating on a solution to real-time monitor Spark logs produced
> by my EMR cluster in order to collect statistics and trigger alarms. Being
> on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and,
> since I'm on AWS, those service are pretty well integrated together..but I
> could just find examples about it using on standalone EC2 instances.
> >
> > In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
> mode), I would like to be able to real-time monitor Spark logs, so not just
> about when the processing ends and they are copied to S3. Is there any
> out-of-the-box solution or best-practice for accomplish this goal when
> running on EMR that I'm not aware of?
> >
> > Spark logs are written on the Data Nodes (Core Instances) local file
> systems as YARN containers logs, so probably installing the awslogs agent
> on them and pointing to those logfiles would help pushing such logs on
> CloudWatch, but I was wondering how the community real-time monitors
> application logs when running Spark on YARN on EMR.
> >
> > Or maybe I'm looking at a wrong solution. Maybe the correct way would be
> using something like a CloudwatchSink so to make Spark (log4j) pushing logs
> directly to the sink and the sink pushing them to CloudWatch (I do like the
> out-of-the-box EMR logging experience and I want to keep the usual eventual
> logs archiving on S3 when the EMR cluster is terminated).
> >
> > Any ideas or experience about this problem?
> >
> > Thank you.
> >
> > Roberto
>
>
> are you talking about event logs as used by the history server, or
> application logs?
>
> the current spark log server writes events to a file, but as the hadoop s3
> fs client doesn't write except in close(), they won't be pushed out while
> thing are running. Someone (you?) could have a go at implementing a new
> event listener; some stuff that will come out in Spark 2.0 will make it
> easier to wire this up (SPARK-11314), which is coming as part of some work
> on spark-YARN timelineserver itnegration.
>
> In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn
> Nodemanagers and automatically copied out, look at
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For
> that to work you need to set up your log wildcard patterns to for the NM to
> locate (i.e. have rolling logs with the right extensions)...the details
> escape me right now
>
> In earlier versions, you can use "yarn logs' to grab them and pull them
> down.
>
> I don't know anything about cloudwatch integration, sorry
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>