Re: Spark on EMR suddenly stalling

2018-01-01 Thread Jeroen Miller
Hello Mans,

On 1 Jan 2018, at 17:12, M Singh  wrote:
> I am not sure if I missed it - but can you let us know what is your input 
> source and output sink ?

Reading from S3 and writing to S3.

However the never-ending task 0.0 happens in a stage way before outputting 
anything to S3.

Regards,

Jeroen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2018-01-01 Thread Jeroen Miller
Hello Gourav,

On 30 Dec 2017, at 20:20, Gourav Sengupta  wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it 
> should be available from the resource manager. I never ever had any problem 
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

For some reason sometimes there is absolutely nothing showing up in the Spark 
UI or the UI is not refreshed, e.g. for the current stage is #x while the logs 
shows stage #y (with y > x) is currently under way.

It may very well be that the source of this problem lies between the keyboard 
and the chair, but if this is the case, I do not know how to solve this.

> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be 
> set to true. 

Thanks for the tip -- will try this setting in my next batch of experiments!

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom line/record delimiter

2018-01-01 Thread sk skk
Thanks for the update Kwon.

Regards,


On Mon, Jan 1, 2018 at 7:54 PM Hyukjin Kwon  wrote:

> Hi,
>
>
> There's a PR - https://github.com/apache/spark/pull/18581 and JIRA
> - SPARK-21289
>
> Alternatively, you could check out multiLine option for CSV and see if
> applicable.
>
>
> Thanks.
>
>
> 2017-12-30 2:19 GMT+09:00 sk skk :
>
>> Hi,
>>
>> Do we have an option to write a csv or text file with a custom
>> record/line separator through spark ?
>>
>> I could not find any ref on the api. I have a issue while loading data
>> into a warehouse as one of the column on csv have a new line character and
>> the warehouse is not letting to escape that new line character .
>>
>> Thank you ,
>> Sk
>>
>
>


Re: Custom line/record delimiter

2018-01-01 Thread Hyukjin Kwon
Hi,


There's a PR - https://github.com/apache/spark/pull/18581 and JIRA
- SPARK-21289

Alternatively, you could check out multiLine option for CSV and see if
applicable.


Thanks.


2017-12-30 2:19 GMT+09:00 sk skk :

> Hi,
>
> Do we have an option to write a csv or text file with a custom record/line
> separator through spark ?
>
> I could not find any ref on the api. I have a issue while loading data
> into a warehouse as one of the column on csv have a new line character and
> the warehouse is not letting to escape that new line character .
>
> Thank you ,
> Sk
>


mesos cluster dispatcher

2018-01-01 Thread puneetloya
hi,

Would like an opinion on using *mesos cluster dispatcher*.
It worked for me on 2 vagrant machines setup( i.e mesos master and slave).
Is it better to start the spark driver using Marathon instead of dispatcher?
the —supervise option can become a pain as you cannot stop the driver.
please share your experience if you use dispatcher in production?

p.s: i did see other discussions on this topic, but they are slightly older.

Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2018-01-01 Thread M Singh
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input 
source and output sink ?  
In some cases, I found that saving to S3 was a problem. In this case I started 
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which 
solved our issue.

Mans 

On Monday, January 1, 2018 7:41 AM, Rohit Karlupia  
wrote:
 

 Here is the list that I will probably try to fill:   
   - Check GC on the offending executor when the task is running. May be you 
need even more memory.  
   - Go back to some previous successful run of the job and check the spark ui 
for the offending stage and check max task time/max input/max shuffle in/out 
for the largest task. Will help you understand the degree of skew in this 
stage. 
   - Take a thread dump of the executor from the Spark UI and verify if the 
task is really doing any work or it stuck in some deadlock. Some of the hive 
serde are not really usable from multi-threaded/multi-use spark executors. 
   - Take a thread dump of the executor from the Spark UI and verify if the 
task is spilling to disk. Playing with storage and memory fraction or generally 
increasing the memory will help. 
   - Check the disk utilisation on the machine running the executor. 
   - Look for event loss messages in the logs due to event queue full. Loss of 
events can send some of the spark components into really bad states.  

thanks,rohitk


On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta  
wrote:

Hi,
Please try to use the SPARK UI from the way that AWS EMR recommends, it should 
be available from the resource manager. I never ever had any problem working 
with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.
Sadly, I cannot be of much help unless we go for a screen share session over 
google chat or skype. 
Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set 
to true. 
Besides that, there is a metrics in the EMR console which shows the number of 
containers getting generated by your job on graphs.


Regards,Gourav Sengupta
On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller  wrote:

Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta  wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 
> instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability 
> zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the 
> difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating 
> by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_x xx/.

> Also if you see the spark UI then you can easily see which particular step is 
> taking the longest period of time - you just have to drill in a bit in order 
> to see that. Generally in case shuffling is an issue then it definitely 
> appears in the SPARK UI as I drill into the steps and see which particular 
> one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to 
date.

JM







   

Re: Spark on EMR suddenly stalling

2018-01-01 Thread Rohit Karlupia
Here is the list that I will probably try to fill:

   1. Check GC on the offending executor when the task is running. May be
   you need even more memory.
   2. Go back to some previous successful run of the job and check the
   spark ui for the offending stage and check max task time/max input/max
   shuffle in/out for the largest task. Will help you understand the degree of
   skew in this stage.
   3. Take a thread dump of the executor from the Spark UI and verify if
   the task is really doing any work or it stuck in some deadlock. Some of the
   hive serde are not really usable from multi-threaded/multi-use spark
   executors.
   4. Take a thread dump of the executor from the Spark UI and verify if
   the task is spilling to disk. Playing with storage and memory fraction or
   generally increasing the memory will help.
   5. Check the disk utilisation on the machine running the executor.
   6. Look for event loss messages in the logs due to event queue full.
   Loss of events can send some of the spark components into really bad
   states.


thanks,
rohitk



On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta  wrote:

> Hi,
>
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> Sadly, I cannot be of much help unless we go for a screen share session
> over google chat or skype.
>
> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Besides that, there is a metrics in the EMR console which shows the number
> of containers getting generated by your job on graphs.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller 
> wrote:
>
>> Hello,
>>
>> Just a quick update as I did not made much progress yet.
>>
>> On 28 Dec 2017, at 21:09, Gourav Sengupta 
>> wrote:
>> > can you try to then use the EMR version 5.10 instead or EMR version
>> 5.11 instead?
>>
>> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>>
>> > can you please try selecting a subnet which is in a different
>> availability zone?
>>
>> I did not try this yet. But why should that make a difference?
>>
>> > if possible just try to increase the number of task instances and see
>> the difference?
>>
>> I tried with 512 partitions -- no difference.
>>
>> > also in case you are using caching,
>>
>> No caching used.
>>
>> > Also can you please report the number of containers that your job is
>> creating by looking at the metrics in the EMR console?
>>
>> 8 containers if I trust the directories in j-xxx/containers/application_x
>> xx/.
>>
>> > Also if you see the spark UI then you can easily see which particular
>> step is taking the longest period of time - you just have to drill in a bit
>> in order to see that. Generally in case shuffling is an issue then it
>> definitely appears in the SPARK UI as I drill into the steps and see which
>> particular one is taking the longest.
>>
>> I always have issues with the Spark UI on EC2 -- it never seems to be up
>> to date.
>>
>> JM
>>
>>
>