Re: How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-25 Thread Fawze Abujaber
http://shzhangji.com/blog/2015/05/31/spark-streaming-logging-configuration/

On Wed, Dec 26, 2018 at 1:05 AM shyla deshpande 
wrote:

> Please point me to any documentation if available. Thanks
>
> On Tue, Dec 18, 2018 at 11:10 AM shyla deshpande 
> wrote:
>
>> Is there a way to do this without stopping the streaming application in
>> yarn cluster mode?
>>
>> On Mon, Dec 17, 2018 at 4:42 PM shyla deshpande 
>> wrote:
>>
>>> I get the ERROR
>>> 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad:
>>> /var/log/hadoop-yarn/containers
>>>
>>> Is there a way to clean up these directories while the spark streaming
>>> application is running?
>>>
>>> Thanks
>>>
>>

-- 
Take Care
Fawze Abujaber


Re: Unable to see completed application in Spark 2 history web UI

2018-08-17 Thread Fawze Abujaber
Thanks Manu for your response.

I already checked the logs and didn't see anything that can help me
understanding the issue.

The more weird thing, i have a small CI cluster which run on single
NameNode and i see the Spark2 job in the UI, i'm still not sure if it may
related to the NameNode HA, i tried to replace the logdir from NameNode HA
to the activeNameNode like this
http://server:8020/user/spark/spark2historyapplication in the spark2
default conf but the UI still showing the the path with the HA NameNode
event after a restart of Spark2.

The issue become more intersting :)

On Fri, Aug 17, 2018 at 2:01 AM Manu Zhang  wrote:

> Hi Fawze,
>
> Sorry but I'm not familiar with CM. Maybe you can look into the logs (or
> turn on DEBUG log).
>
> On Thu, Aug 16, 2018 at 3:05 PM Fawze Abujaber  wrote:
>
>> Hi Manu,
>>
>> I'm using cloudera manager with single user mode and every process is
>> running with cloudera-scm user, the cloudera-scm is a super user and this
>> is why i was confused how it worked in spark 1.6 and not in spark 2.3
>>
>>
>> On Thu, Aug 16, 2018 at 5:34 AM Manu Zhang 
>> wrote:
>>
>>> If you are able to log onto the node where UI has been launched, then
>>> try `ps -aux | grep HistoryServer` and the first column of output should be
>>> the user.
>>>
>>> On Wed, Aug 15, 2018 at 10:26 PM Fawze Abujaber 
>>> wrote:
>>>
>>>> Thanks Manu, Do you know how i can see which user the UI is running,
>>>> because i'm using cloudera manager and i created a user for cloudera
>>>> manager and called it spark but this didn't solve me issue and here i'm
>>>> trying to find out the user for the spark hisotry UI.
>>>>
>>>> On Wed, Aug 15, 2018 at 5:11 PM Manu Zhang 
>>>> wrote:
>>>>
>>>>> Hi Fawze,
>>>>>
>>>>> A) The file permission is currently hard coded to 770 (
>>>>> https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
>>>>> ).
>>>>> B) I think add all users (including UI) to the group like Spark will
>>>>> do.
>>>>>
>>>>>
>>>>> On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber 
>>>>> wrote:
>>>>>
>>>>>> Hi Manu,
>>>>>>
>>>>>> Thanks for your response.
>>>>>>
>>>>>> Yes, i see but still interesting to know how i can see these
>>>>>> applications from the spark history UI.
>>>>>>
>>>>>> How i can know with which user i'm  logged in when i'm navigating the
>>>>>> spark history UI.
>>>>>>
>>>>>> The Spark process is running with cloudera-scm and the events written
>>>>>> in the spark2history folder at the HDFS written with the user name who is
>>>>>> running the application and group spark (770 permissions).
>>>>>>
>>>>>> I'm interesting to see if i can force these logs to be written with
>>>>>> 774 or 775 permission or finding another solutions that enable Rnd or
>>>>>> anyone to be able to investigate his application logs using the UI.
>>>>>>
>>>>>> for example : can i use such spark conf :
>>>>>> spark.eventLog.permissions=755
>>>>>>
>>>>>> The 2 options i see here:
>>>>>>
>>>>>> A) find a way to enforce these logs to be written with other
>>>>>> permissions.
>>>>>>
>>>>>> B) Find the user that the UI running with as creating LDAP groups and
>>>>>> user that can handle this.
>>>>>>
>>>>>> for example creating group called Spark and create the user that the
>>>>>> UI running with and add this user to the spark group.
>>>>>> not sure if this option will work as i don't know if these steps
>>>>>> authenticate against the LDAP.
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Take Care
>>>> Fawze Abujaber
>>>>
>>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>

-- 
Take Care
Fawze Abujaber


Re: Unable to see completed application in Spark 2 history web UI

2018-08-16 Thread Fawze Abujaber
Hi Manu,

I'm using cloudera manager with single user mode and every process is
running with cloudera-scm user, the cloudera-scm is a super user and this
is why i was confused how it worked in spark 1.6 and not in spark 2.3


On Thu, Aug 16, 2018 at 5:34 AM Manu Zhang  wrote:

> If you are able to log onto the node where UI has been launched, then try
> `ps -aux | grep HistoryServer` and the first column of output should be the
> user.
>
> On Wed, Aug 15, 2018 at 10:26 PM Fawze Abujaber  wrote:
>
>> Thanks Manu, Do you know how i can see which user the UI is running,
>> because i'm using cloudera manager and i created a user for cloudera
>> manager and called it spark but this didn't solve me issue and here i'm
>> trying to find out the user for the spark hisotry UI.
>>
>> On Wed, Aug 15, 2018 at 5:11 PM Manu Zhang 
>> wrote:
>>
>>> Hi Fawze,
>>>
>>> A) The file permission is currently hard coded to 770 (
>>> https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
>>> ).
>>> B) I think add all users (including UI) to the group like Spark will do.
>>>
>>>
>>> On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber 
>>> wrote:
>>>
>>>> Hi Manu,
>>>>
>>>> Thanks for your response.
>>>>
>>>> Yes, i see but still interesting to know how i can see these
>>>> applications from the spark history UI.
>>>>
>>>> How i can know with which user i'm  logged in when i'm navigating the
>>>> spark history UI.
>>>>
>>>> The Spark process is running with cloudera-scm and the events written
>>>> in the spark2history folder at the HDFS written with the user name who is
>>>> running the application and group spark (770 permissions).
>>>>
>>>> I'm interesting to see if i can force these logs to be written with 774
>>>> or 775 permission or finding another solutions that enable Rnd or anyone to
>>>> be able to investigate his application logs using the UI.
>>>>
>>>> for example : can i use such spark conf :
>>>> spark.eventLog.permissions=755
>>>>
>>>> The 2 options i see here:
>>>>
>>>> A) find a way to enforce these logs to be written with other
>>>> permissions.
>>>>
>>>> B) Find the user that the UI running with as creating LDAP groups and
>>>> user that can handle this.
>>>>
>>>> for example creating group called Spark and create the user that the UI
>>>> running with and add this user to the spark group.
>>>> not sure if this option will work as i don't know if these steps
>>>> authenticate against the LDAP.
>>>>
>>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>

-- 
Take Care
Fawze Abujaber


Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Fawze Abujaber
Thanks Manu, Do you know how i can see which user the UI is running,
because i'm using cloudera manager and i created a user for cloudera
manager and called it spark but this didn't solve me issue and here i'm
trying to find out the user for the spark hisotry UI.

On Wed, Aug 15, 2018 at 5:11 PM Manu Zhang  wrote:

> Hi Fawze,
>
> A) The file permission is currently hard coded to 770 (
> https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
> ).
> B) I think add all users (including UI) to the group like Spark will do.
>
>
> On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber  wrote:
>
>> Hi Manu,
>>
>> Thanks for your response.
>>
>> Yes, i see but still interesting to know how i can see these applications
>> from the spark history UI.
>>
>> How i can know with which user i'm  logged in when i'm navigating the
>> spark history UI.
>>
>> The Spark process is running with cloudera-scm and the events written in
>> the spark2history folder at the HDFS written with the user name who is
>> running the application and group spark (770 permissions).
>>
>> I'm interesting to see if i can force these logs to be written with 774
>> or 775 permission or finding another solutions that enable Rnd or anyone to
>> be able to investigate his application logs using the UI.
>>
>> for example : can i use such spark conf : spark.eventLog.permissions=755
>>
>> The 2 options i see here:
>>
>> A) find a way to enforce these logs to be written with other permissions.
>>
>> B) Find the user that the UI running with as creating LDAP groups and
>> user that can handle this.
>>
>> for example creating group called Spark and create the user that the UI
>> running with and add this user to the spark group.
>> not sure if this option will work as i don't know if these steps
>> authenticate against the LDAP.
>>
>

-- 
Take Care
Fawze Abujaber


Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Fawze Abujaber
Hi Manu,

Thanks for your response.

Yes, i see but still interesting to know how i can see these applications
from the spark history UI.

How i can know with which user i'm  logged in when i'm navigating the spark
history UI.

The Spark process is running with cloudera-scm and the events written in
the spark2history folder at the HDFS written with the user name who is
running the application and group spark (770 permissions).

I'm interesting to see if i can force these logs to be written with 774 or
775 permission or finding another solutions that enable Rnd or anyone to be
able to investigate his application logs using the UI.

for example : can i use such spark conf : spark.eventLog.permissions=755

The 2 options i see here:

A) find a way to enforce these logs to be written with other permissions.

B) Find the user that the UI running with as creating LDAP groups and user
that can handle this.

for example creating group called Spark and create the user that the UI
running with and add this user to the spark group.
not sure if this option will work as i don't know if these steps
authenticate against the LDAP.


Unable to see completed application in Spark 2 history web UI

2018-08-07 Thread Fawze Abujaber
Hello Community,

I'm using Spark 2.3 and Spark 1.6.0 in my cluster with Cloudera
distribution 5.13.0.

Both are configured to run on Yarn, but i'm unable to see completed
application in Spark2 history server, while in Spark 1.6.0 i did.

1) I checked the HDFS permissions for both folders and both have the same
permissions.

drwxrwxrwt   - cloudera-scm spark  0 2018-08-08 00:46
/user/spark/applicationHistory
drwxrwxrwt   - cloudera-scm spark  0 2018-08-08 00:46
/user/spark/spark2ApplicationHistory

The applications file itself running with permissions 770 in both.

-rwxrwx---   3  fawzea spark 4743751 2018-08-07 23:32
/user/spark/spark2ApplicationHistory/application_1527404701551_672816_1
-rwxrwx---   3  fawzea spark   134315 2018-08-08 00:41
/user/spark/applicationHistory/application_1527404701551_673359_1

2) No error in the Spark2 history server log.

3) Compared the configurations between Spark 1.6 and Spark 2.3 like system
user, enable log, etc ... all looks the same.

4) Once i changed the permissions for the above Spark2 applications to 777,
i was able to see the application in the spark2 history server UI.

Tried to figure out if the 2 Sparks UIs running with different users but
was unable to find it.

Anyone who ran into this issue and solved it?

Thanks in advance.


-- 
Take Care
Fawze Abujaber


Re: native-lzo library not available

2018-05-03 Thread Fawze Abujaber
Hi Yulia,

Thanks for you response.

i see only lzo only for impala

 [root@xxx ~]# locate *lzo*.so*
/opt/cloudera/parcels/GPLEXTRAS-5.13.0-1.cdh5.13.0.p0.29/lib/impala/lib/libimpalalzo.so
/usr/lib64/liblzo2.so.2
/usr/lib64/liblzo2.so.2.0.0

the 
/opt/cloudera/parcels/GPLEXTRAS-5.13.0-1.cdh5.13.0.p0.29/lib/hadoop/lib/native
has :

-rwxr-xr-x 1 cloudera-scm cloudera-scm 22918 Oct  4  2017
libgplcompression.a
-rwxr-xr-x 1 cloudera-scm cloudera-scm  1204 Oct  4  2017
libgplcompression.la
-rwxr-xr-x 1 cloudera-scm cloudera-scm  1205 Oct  4  2017
libgplcompression.lai
-rwxr-xr-x 1 cloudera-scm cloudera-scm 15760 Oct  4  2017
libgplcompression.so
-rwxr-xr-x 1 cloudera-scm cloudera-scm 15768 Oct  4  2017
libgplcompression.so.0
-rwxr-xr-x 1 cloudera-scm cloudera-scm 15768 Oct  4  2017
libgplcompression.so.0.0.0


and 
/opt/cloudera/parcels/GPLEXTRAS-5.13.0-1.cdh5.13.0.p0.29/lib/spark-netlib/lib
has:

-rw-r--r-- 1 cloudera-scm cloudera-scm8673 Oct  4  2017
jniloader-1.1.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm   53249 Oct  4  2017
native_ref-java-1.1.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm   53295 Oct  4  2017
native_system-java-1.1.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm 1732268 Oct  4  2017
netlib-native_ref-linux-x86_64-1.1-natives.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm  446694 Oct  4  2017
netlib-native_system-linux-x86_64-1.1-natives.jar


Note: The issue occuring only with the spark job, mapreduce job working
fine.

On Thu, May 3, 2018 at 9:17 PM, yuliya Feldman <yufeld...@yahoo.com> wrote:

> Jar is not enough, you need native library (*.so) - see if your "native"
> directory contains it
>
> drwxr-xr-x 2 cloudera-scm cloudera-scm  4096 Oct  4  2017 native
>
> and whether  java.library.path or LD_LIBRARY_PATH points/includes
> directory where your *.so library resides
>
> On Thursday, May 3, 2018, 5:06:35 AM PDT, Fawze Abujaber <
> fawz...@gmail.com> wrote:
>
>
> Hi Guys,
>
> I'm running into issue where my spark jobs are failing on the below error,
> I'm using Spark 1.6.0 with CDH 5.13.0.
>
> I tried to figure it out with no success.
>
> Will appreciate any help or a direction how to attack this issue.
>
> User class threw exception: org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 1.0 (TID 3, xx, executor 1): 
> *java.lang.RuntimeException:
> native-lzo library not available*
> *at
> com.hadoop.compression.lzo.LzoCodec.getDecompressorType(LzoCodec.java:193)*
> at org.apache.hadoop.io.compress.CodecPool.getDecompressor(
> CodecPool.java:181)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1995)
> at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:
> 1881)
> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1830)
> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1844)
> at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.
> initialize(SequenceFileRecordReader.java:54)
> at com.liveperson.dallas.lp.utils.incremental.
> DallasGenericTextFileRecordReader.initialize(
> DallasGenericTextFileRecordReader.java:64)
> at com.liveperson.hadoop.fs.inputs.LPCombineFileRecordReaderWrapp
> er.initialize(LPCombineFileRecordReaderWrapper.java:38)
> at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.
> initialize(CombineFileRecordReader.java:63)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(
> NewHadoopRDD.scala:168)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:133)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.co

native-lzo library not available

2018-05-03 Thread Fawze Abujaber
Hi Guys,

I'm running into issue where my spark jobs are failing on the below error,
I'm using Spark 1.6.0 with CDH 5.13.0.

I tried to figure it out with no success.

Will appreciate any help or a direction how to attack this issue.

User class threw exception: org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent
failure: Lost task 0.3 in stage 1.0 (TID 3, xx, executor 1):
*java.lang.RuntimeException:
native-lzo library not available*
*at
com.hadoop.compression.lzo.LzoCodec.getDecompressorType(LzoCodec.java:193)*
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1995)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1881)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1830)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1844)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
at
com.liveperson.dallas.lp.utils.incremental.DallasGenericTextFileRecordReader.initialize(DallasGenericTextFileRecordReader.java:64)
at
com.liveperson.hadoop.fs.inputs.LPCombineFileRecordReaderWrapper.initialize(LPCombineFileRecordReaderWrapper.java:38)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initialize(CombineFileRecordReader.java:63)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:168)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:



I see the LZO at GPextras:

ll
total 104
-rw-r--r-- 1 cloudera-scm cloudera-scm 35308 Oct  4  2017 COPYING.hadoop-lzo
-rw-r--r-- 1 cloudera-scm cloudera-scm 62268 Oct  4  2017
hadoop-lzo-0.4.15-cdh5.13.0.jar
lrwxrwxrwx 1 cloudera-scm cloudera-scm31 May  3 07:23 hadoop-lzo.jar ->
hadoop-lzo-0.4.15-cdh5.13.0.jar
drwxr-xr-x 2 cloudera-scm cloudera-scm  4096 Oct  4  2017 native




-- 
Take Care
Fawze Abujaber


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Fawze Abujaber
Thanks for the update.

What about cores per executor?

On Tue, 27 Mar 2018 at 6:45 Rohit Karlupia <roh...@qubole.com> wrote:

> Thanks Fawze!
>
> On the memory front, I am currently working on GC and CPU aware task
> scheduling. I see wonderful results based on my tests so far.  Once the
> feature is complete and available, spark will work with whatever memory is
> provided (at least enough for the largest possible task). It will also
> allow you to run say 64 concurrent tasks on 8 core machine, if the nature
> of tasks doesn't leads to memory or CPU contention. Essentially why worry
> about tuning memory when you can let spark take care of it automatically
> based on memory pressure. Will post details when we are ready.  So yes we
> are working on memory, but it will not be a tool but a transparent feature.
>
> thanks,
> rohitk
>
>
>
>
> On Tue, Mar 27, 2018 at 7:53 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
>
>> Hi Rohit,
>>
>> I would like to thank you for the unlimited patience and support that you
>> are providing here and behind the scene for all of us.
>>
>> The tool is amazing and easy to use and understand most of the metrics ...
>>
>> Thinking if we need to run it in cluster mode and all the time, i think
>> we can skip it as one or few runs can give you the large picture of how the
>> job is running with different configuration and it's not too much
>> complicated to run it using spark-submit.
>>
>> I think it will be so helpful if the sparklens can also include how the
>> job is running with different configuration of cores and memory, Spark job
>> with 1 exec and 1 core will run different from spark job with 1  exec and 3
>> cores and for sure the same compare with different exec memory.
>>
>> Overall, it is so good starting point, but it will be a GAME CHANGER
>> getting these metrics on the tool.
>>
>> @Rohit , Huge THANY YOU
>>
>> On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia <roh...@qubole.com>
>> wrote:
>>
>>> Hi Shmuel,
>>>
>>> In general it is hard to pin point to exact code which is responsible
>>> for a specific stage. For example when using spark sql, depending upon the
>>> kind of joins, aggregations used in the the single line of query, we will
>>> have multiple stages in the spark application. I usually try to split the
>>> code into smaller chunks and also use the spark UI which has special
>>> section for SQL. It can also show specific backtraces, but as I explained
>>> earlier they might not be very helpful. Sparklens does help you ask the
>>> right questions, but is not mature enough to answer all of them.
>>>
>>> Understanding the report:
>>>
>>> *1) The first part of total aggregate metrics for the application.*
>>>
>>> Printing application meterics.
>>>
>>>  AggregateMetrics (Application Metrics) total measurements 1869
>>> NAMESUMMIN  
>>>  MAXMEAN
>>>  diskBytesSpilled0.0 KB 0.0 KB 
>>> 0.0 KB  0.0 KB
>>>  executorRuntime15.1 hh 3.0 ms 
>>> 4.0 mm 29.1 ss
>>>  inputBytesRead 26.1 GB 0.0 KB
>>> 43.8 MB 14.3 MB
>>>  jvmGCTime  11.0 mm 0.0 ms 
>>> 2.1 ss354.0 ms
>>>  memoryBytesSpilled314.2 GB 0.0 KB 
>>> 1.1 GB172.1 MB
>>>  outputBytesWritten  0.0 KB 0.0 KB 
>>> 0.0 KB  0.0 KB
>>>  peakExecutionMemory 0.0 KB 0.0 KB 
>>> 0.0 KB  0.0 KB
>>>  resultSize 12.9 MB 2.0 KB
>>> 40.9 KB  7.1 KB
>>>  shuffleReadBytesRead  107.7 GB 0.0 KB   
>>> 276.0 MB 59.0 MB
>>>  shuffleReadFetchWaitTime2.0 ms 0.0 ms 
>>> 0.0 ms  0.0 ms
>>>  shuffleReadLocalBlocks   2,318  0  
>>>68   1
>>>  shuffleReadRecordsRead   3,413,511,099  0  
>>> 8,251,926   1,826,383
>>>  shuffleReadRemoteBlocks291,126  0  
>>>   824   

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Fawze Abujaber
  14.9   85.10.0 KB
>   15   34.00   48.02   2000.00.0 KB0.0 KB   53.2 GB
> 0.0 KB04m 03s   00m 34s50h 42m   14.3   85.70.0 KB
>
>
>  Stage-ID WallClock  OneCore   Task   PRatio-Task--   OIRatio 
>  |* ShuffleWrite% ReadFetch%   GC%  *|
>   Stage% ComputeHours  CountSkew   StageSkew
>   00.32 00h 00m   10.00 1.00 0.37 0.00
>  |*   0.00   0.0015.10  *|
>   10.35 00h 00m   10.00 1.00 0.38 0.00
>  |*   0.00   0.0015.56  *|
>   20.43 00h 00m   10.00 1.00 0.45 0.00
>  |*   0.00   0.00 8.88  *|
>   30.70 00h 00m   10.00 1.00 0.63 0.17
>  |*   4.51   0.00 6.74  *|
>   40.31 00h 00m 2000.2737.67 0.10 0.00
>  |*   0.00   0.0423.79  *|
>   56.38 00h 10m  230.03 1.42 0.83 3.18
>  |*   1.08   0.00 2.72  *|
>   6   17.68 03h 00m 6010.80 2.07 0.29 0.10
>  |*   0.60   0.00 1.90  *|
>   96.58 00h 06m 2000.27 5.20 0.16 0.38
>  |*   4.74  13.24 4.04  *|
>  10   13.40 00h 20m  390.05 1.67 0.52 3.17
>  |*   1.10   0.00 1.96  *|
>  110.07 00h 00m   20.00 1.00 0.58 1.91
>  |*  13.59   0.00 0.00  *|
>  121.77 00h 19m 2000.27 1.99 0.92 0.50
>  |*   1.85  19.63 3.09  *|
>  133.57 00h 53m 2000.27 1.59 1.0031.42
>  |*   6.06  12.25 1.33  *|
>  14   13.74 02h 59m 2000.27 1.65 0.89 1.00
>  |*   1.84   2.38 0.83  *|
>  15   34.69 07h 15m 2000.27 1.88 0.98 0.00
>  |*   0.00   4.21 0.88  *|
>
> PRatio:Number of tasks in stage divided by number of cores. 
> Represents degree of
>parallelism in the stage
> TaskSkew:  Duration of largest task in stage divided by duration of 
> median task.
>Represents degree of skew in the stage
> TaskStageSkew: Duration of largest task in stage divided by total duration of 
> the stage.
>Represents the impact of the largest task on stage time.
> OIRatio:   Output to input ration. Total output of the stage (results + 
> shuffle write)
>divided by total input (input data + shuffle read)
>
> These metrics below represent distribution of time within the stage
>
> ShuffleWrite:  Amount of time spent in shuffle writes across all tasks in the 
> given
>stage as a percentage
> ReadFetch: Amount of time spent in shuffle read across all tasks in the 
> given
>stage as a percentage
> GC:Amount of time spent in GC across all tasks in the given stage 
> as a
>percentage
>
> If the stage contributes large percentage to overall application time, we 
> could look into
> these metrics to check which part (Shuffle write, read fetch or GC is 
> responsible)
>
> thanks,
>
> rohitk
>
>
>
> On Mon, Mar 26, 2018 at 1:38 AM, Shmuel Blitz <shmuel.bl...@similarweb.com
> > wrote:
>
>> Hi Rohit,
>>
>> Thanks for the analysis.
>>
>> I can use repartition on the slow task. But how can I tell what part of
>> the code is in charge of the slow tasks?
>>
>> It would be great if you could further explain the rest of the output.
>>
>> Thanks in advance,
>> Shmuel
>>
>> On Sun, Mar 25, 2018 at 12:46 PM, Rohit Karlupia <roh...@qubole.com>
>> wrote:
>>
>>> Thanks Shamuel for trying out sparklens!
>>>
>>> Couple of things that I noticed:
>>> 1) 250 executors is probably overkill for this job. It would run in same
>>> time with around 100.
>>> 2) Many of stages that take long time have only 200 tasks where as we
>>> have 750 cores available for the job. 200 is the default value for
>>> spark.sql.shuffle.partitions.  Alternatively you could try increasing
>>> the value of spark.sql.shuffle.partitions to latest 750.
>>>
>>> thanks,
>>> rohitk
>>>
>>> On Sun, Mar 25, 2018 at 1:25 PM, Shmuel Blitz <
>>> shmuel.bl...@similarweb.com> wrote:
>>>
>>>> I ran it on a single job.
>>>> SparkLens has an overhead on the jo

Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
Hi Marcelo,

Weird, I just ran spark-shell and it's log is comprised but  my spark jobs
that scheduled using oozie is not getting compressed.

On Mon, Mar 26, 2018 at 8:56 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> You're either doing something wrong, or talking about different logs.
> I just added that to my config and ran spark-shell.
>
> $ hdfs dfs -ls /user/spark/applicationHistory | grep
> application_1522085988298_0002
> -rwxrwx---   3 blah blah   9844 2018-03-26 10:54
> /user/spark/applicationHistory/application_1522085988298_0002.snappy
>
>
>
> On Mon, Mar 26, 2018 at 10:48 AM, Fawze Abujaber <fawz...@gmail.com>
> wrote:
> > I distributed this config to all the nodes cross the cluster and with no
> > success, new spark logs still uncompressed.
> >
> > On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> Spark should be using the gateway's configuration. Unless you're
> >> launching the application from a different node, if the setting is
> >> there, Spark should be using it.
> >>
> >> You can also look in the UI's environment page to see the
> >> configuration that the app is using.
> >>
> >> On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber <fawz...@gmail.com>
> >> wrote:
> >> > I see this configuration only on the spark gateway server, and my
> spark
> >> > is
> >> > running on Yarn, so I think I missing something ...
> >> >
> >> > I’m using cloudera manager to set this parameter, maybe I need to add
> >> > this
> >> > parameter in other configuration
> >> >
> >> > On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >> >>
> >> >> If the spark-defaults.conf file in the machine where you're starting
> >> >> the Spark app has that config, then that's all that should be needed.
> >> >>
> >> >> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber <fawz...@gmail.com>
> >> >> wrote:
> >> >> > Thanks Marcelo,
> >> >> >
> >> >> > Yes I was was expecting to see the new apps compressed but I don’t
> ,
> >> >> > do
> >> >> > I
> >> >> > need to perform restart to spark or Yarn?
> >> >> >
> >> >> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Log compression is a client setting. Doing that will make new apps
> >> >> >> write event logs in compressed format.
> >> >> >>
> >> >> >> The SHS doesn't compress existing logs.
> >> >> >>
> >> >> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <
> fawz...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi All,
> >> >> >> >
> >> >> >> > I'm trying to compress the logs at SPark history server, i added
> >> >> >> > spark.eventLog.compress=true to spark-defaults.conf to spark
> Spark
> >> >> >> > Client
> >> >> >> > Advanced Configuration Snippet (Safety Valve) for
> >> >> >> > spark-conf/spark-defaults.conf
> >> >> >> >
> >> >> >> > which i see applied only to the spark gateway servers spark
> conf.
> >> >> >> >
> >> >> >> > What i missing to get this working ?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Marcelo
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Marcelo
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
I distributed this config to all the nodes cross the cluster and with no
success, new spark logs still uncompressed.

On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Spark should be using the gateway's configuration. Unless you're
> launching the application from a different node, if the setting is
> there, Spark should be using it.
>
> You can also look in the UI's environment page to see the
> configuration that the app is using.
>
> On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber <fawz...@gmail.com>
> wrote:
> > I see this configuration only on the spark gateway server, and my spark
> is
> > running on Yarn, so I think I missing something ...
> >
> > I’m using cloudera manager to set this parameter, maybe I need to add
> this
> > parameter in other configuration
> >
> > On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin <van...@cloudera.com> wrote:
> >>
> >> If the spark-defaults.conf file in the machine where you're starting
> >> the Spark app has that config, then that's all that should be needed.
> >>
> >> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber <fawz...@gmail.com>
> >> wrote:
> >> > Thanks Marcelo,
> >> >
> >> > Yes I was was expecting to see the new apps compressed but I don’t ,
> do
> >> > I
> >> > need to perform restart to spark or Yarn?
> >> >
> >> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >> >>
> >> >> Log compression is a client setting. Doing that will make new apps
> >> >> write event logs in compressed format.
> >> >>
> >> >> The SHS doesn't compress existing logs.
> >> >>
> >> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <fawz...@gmail.com>
> >> >> wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > I'm trying to compress the logs at SPark history server, i added
> >> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
> >> >> > Client
> >> >> > Advanced Configuration Snippet (Safety Valve) for
> >> >> > spark-conf/spark-defaults.conf
> >> >> >
> >> >> > which i see applied only to the spark gateway servers spark conf.
> >> >> >
> >> >> > What i missing to get this working ?
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Marcelo
> >>
> >>
> >>
> >> --
> >> Marcelo
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
I see this configuration only on the spark gateway server, and my spark is
running on Yarn, so I think I missing something ...

I’m using cloudera manager to set this parameter, maybe I need to add this
parameter in other configuration

On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin <van...@cloudera.com> wrote:

> If the spark-defaults.conf file in the machine where you're starting
> the Spark app has that config, then that's all that should be needed.
>
> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber <fawz...@gmail.com>
> wrote:
> > Thanks Marcelo,
> >
> > Yes I was was expecting to see the new apps compressed but I don’t , do I
> > need to perform restart to spark or Yarn?
> >
> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com> wrote:
> >>
> >> Log compression is a client setting. Doing that will make new apps
> >> write event logs in compressed format.
> >>
> >> The SHS doesn't compress existing logs.
> >>
> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <fawz...@gmail.com>
> wrote:
> >> > Hi All,
> >> >
> >> > I'm trying to compress the logs at SPark history server, i added
> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
> >> > Client
> >> > Advanced Configuration Snippet (Safety Valve) for
> >> > spark-conf/spark-defaults.conf
> >> >
> >> > which i see applied only to the spark gateway servers spark conf.
> >> >
> >> > What i missing to get this working ?
> >>
> >>
> >>
> >> --
> >> Marcelo
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
Thanks Marcelo,

Yes I was was expecting to see the new apps compressed but I don’t , do I
need to perform restart to spark or Yarn?

On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com> wrote:

> Log compression is a client setting. Doing that will make new apps
> write event logs in compressed format.
>
> The SHS doesn't compress existing logs.
>
> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
> > Hi All,
> >
> > I'm trying to compress the logs at SPark history server, i added
> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark Client
> > Advanced Configuration Snippet (Safety Valve) for
> > spark-conf/spark-defaults.conf
> >
> > which i see applied only to the spark gateway servers spark conf.
> >
> > What i missing to get this working ?
>
>
>
> --
> Marcelo
>


Spark logs compression

2018-03-26 Thread Fawze Abujaber
Hi All,

I'm trying to compress the logs at SPark history server, i
added spark.eventLog.compress=true to spark-defaults.conf to spark Spark
Client Advanced Configuration Snippet (Safety Valve) for
spark-conf/spark-defaults.conf

which i see applied only to the spark gateway servers spark conf.

What i missing to get this working ?


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-25 Thread Fawze Abujaber
Nice!

Shmuel, Were you able to run on a cluster level or for a specific job?

Did you configure it on the spark-default.conf?

On Sun, 25 Mar 2018 at 10:34 Shmuel Blitz <shmuel.bl...@similarweb.com>
wrote:

> Just to let you know, I have managed to run SparkLens on our cluster.
>
> I switched to the spark_1.6 branch, and also compiled against the specific
> image of Spark we are using (cdh5.7.6).
>
> Now I need to figure out what the output means... :P
>
> Shmuel
>
> On Fri, Mar 23, 2018 at 7:24 PM, Fawze Abujaber <fawz...@gmail.com> wrote:
>
>> Quick question:
>>
>> how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
>> spark-default conf, should it be using:
>>
>> spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i
>> should use spark.jars option? anyone who could give an example how it
>> should be, and if i the path for the jar should be an hdfs path as i'm
>> using it in cluster mode.
>>
>>
>>
>>
>> On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fawz...@gmail.com>
>> wrote:
>>
>>> Hi Shmuel,
>>>
>>> Did you compile the code against the right branch for Spark 1.6.
>>>
>>> I tested it and it looks working and now i'm testing the branch for a
>>> wide tests, Please use the branch for Spark 1.6
>>>
>>> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
>>> shmuel.bl...@similarweb.com> wrote:
>>>
>>>> Hi Rohit,
>>>>
>>>> Thanks for sharing this great tool.
>>>> I tried running a spark job with the tool, but it failed with an 
>>>> *IncompatibleClassChangeError
>>>> *Exception.
>>>>
>>>> I have opened an issue on Github.(https://github.com/
>>>> qubole/sparklens/issues/1)
>>>>
>>>> Shmuel
>>>>
>>>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>>>> shmuel.bl...@similarweb.com> wrote:
>>>>
>>>>> Thanks.
>>>>>
>>>>> We will give this a try and report back.
>>>>>
>>>>> Shmuel
>>>>>
>>>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <roh...@qubole.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks everyone!
>>>>>> Please share how it works and how it doesn't. Both help.
>>>>>>
>>>>>> Fawaze, just made few changes to make this work with spark 1.6. Can
>>>>>> you please try building from branch *spark_1.6*
>>>>>>
>>>>>> thanks,
>>>>>> rohitk
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fawz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It's super amazing  i see it was tested on spark 2.0.0 and
>>>>>>> above, what about Spark 1.6 which is still part of Cloudera's main 
>>>>>>> versions?
>>>>>>>
>>>>>>> We have a vast Spark applications with version 1.6.0
>>>>>>>
>>>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Excellent. You filled a missing link.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Passion
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <
>>>>>>>>> roh...@qubole.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>>>> applications for lower runtime or cost.
>>>>>>>>>>
>>>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
>>>>>>>>>> spark-tuning-tool/
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>> rohitk
>>>>>>>>>>
>>>>>>>>>> PS: Thanks for the patience. It took couple of months to get back
>>>>>>>>>> on this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shmuel Blitz
>>>>> Big Data Developer
>>>>> Email: shmuel.bl...@similarweb.com
>>>>> www.similarweb.com
>>>>> <https://www.facebook.com/SimilarWeb/>
>>>>> <https://www.linkedin.com/company/429838/>
>>>>> <https://twitter.com/similarweb>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Shmuel Blitz
>>>> Big Data Developer
>>>> Email: shmuel.bl...@similarweb.com
>>>> www.similarweb.com
>>>> <https://www.facebook.com/SimilarWeb/>
>>>> <https://www.linkedin.com/company/429838/>
>>>> <https://twitter.com/similarweb>
>>>>
>>>
>>>
>>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.bl...@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-23 Thread Fawze Abujaber
Quick question:

how to add the  --jars /path/to/sparklens_2.11-0.1.0.jar to the
spark-default conf, should it be using:

spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i should
use spark.jars option? anyone who could give an example how it should be,
and if i the path for the jar should be an hdfs path as i'm using it in
cluster mode.




On Fri, Mar 23, 2018 at 6:33 AM, Fawze Abujaber <fawz...@gmail.com> wrote:

> Hi Shmuel,
>
> Did you compile the code against the right branch for Spark 1.6.
>
> I tested it and it looks working and now i'm testing the branch for a wide
> tests, Please use the branch for Spark 1.6
>
> On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <
> shmuel.bl...@similarweb.com> wrote:
>
>> Hi Rohit,
>>
>> Thanks for sharing this great tool.
>> I tried running a spark job with the tool, but it failed with an 
>> *IncompatibleClassChangeError
>> *Exception.
>>
>> I have opened an issue on Github.(https://github.com/qub
>> ole/sparklens/issues/1)
>>
>> Shmuel
>>
>> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <
>> shmuel.bl...@similarweb.com> wrote:
>>
>>> Thanks.
>>>
>>> We will give this a try and report back.
>>>
>>> Shmuel
>>>
>>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <roh...@qubole.com>
>>> wrote:
>>>
>>>> Thanks everyone!
>>>> Please share how it works and how it doesn't. Both help.
>>>>
>>>> Fawaze, just made few changes to make this work with spark 1.6. Can you
>>>> please try building from branch *spark_1.6*
>>>>
>>>> thanks,
>>>> rohitk
>>>>
>>>>
>>>>
>>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fawz...@gmail.com>
>>>> wrote:
>>>>
>>>>> It's super amazing  i see it was tested on spark 2.0.0 and above,
>>>>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>>
>>>>> We have a vast Spark applications with version 1.6.0
>>>>>
>>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>>
>>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>>
>>>>>>> Excellent. You filled a missing link.
>>>>>>>
>>>>>>> Best,
>>>>>>> Passion
>>>>>>>
>>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <roh...@qubole.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>>> applications for lower runtime or cost.
>>>>>>>>
>>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>>>> ark-tuning-tool/
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> rohitk
>>>>>>>>
>>>>>>>> PS: Thanks for the patience. It took couple of months to get back
>>>>>>>> on this.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.bl...@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.bl...@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-22 Thread Fawze Abujaber
Hi Shmuel,

Did you compile the code against the right branch for Spark 1.6.

I tested it and it looks working and now i'm testing the branch for a wide
tests, Please use the branch for Spark 1.6

On Fri, Mar 23, 2018 at 12:43 AM, Shmuel Blitz <shmuel.bl...@similarweb.com>
wrote:

> Hi Rohit,
>
> Thanks for sharing this great tool.
> I tried running a spark job with the tool, but it failed with an 
> *IncompatibleClassChangeError
> *Exception.
>
> I have opened an issue on Github.(https://github.com/
> qubole/sparklens/issues/1)
>
> Shmuel
>
> On Thu, Mar 22, 2018 at 5:05 PM, Shmuel Blitz <shmuel.bl...@similarweb.com
> > wrote:
>
>> Thanks.
>>
>> We will give this a try and report back.
>>
>> Shmuel
>>
>> On Thu, Mar 22, 2018 at 4:22 PM, Rohit Karlupia <roh...@qubole.com>
>> wrote:
>>
>>> Thanks everyone!
>>> Please share how it works and how it doesn't. Both help.
>>>
>>> Fawaze, just made few changes to make this work with spark 1.6. Can you
>>> please try building from branch *spark_1.6*
>>>
>>> thanks,
>>> rohitk
>>>
>>>
>>>
>>> On Thu, Mar 22, 2018 at 10:18 AM, Fawze Abujaber <fawz...@gmail.com>
>>> wrote:
>>>
>>>> It's super amazing  i see it was tested on spark 2.0.0 and above,
>>>> what about Spark 1.6 which is still part of Cloudera's main versions?
>>>>
>>>> We have a vast Spark applications with version 1.6.0
>>>>
>>>> On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Super exciting! I look forward to digging through it this weekend.
>>>>>
>>>>> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>
>>>>>> Excellent. You filled a missing link.
>>>>>>
>>>>>> Best,
>>>>>> Passion
>>>>>>
>>>>>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia <roh...@qubole.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Happy to announce the availability of Sparklens as open source
>>>>>>> project. It helps in understanding the  scalability limits of spark
>>>>>>> applications and can be a useful guide on the path towards tuning
>>>>>>> applications for lower runtime or cost.
>>>>>>>
>>>>>>> Please clone from here: https://github.com/qubole/sparklens
>>>>>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-sp
>>>>>>> ark-tuning-tool/
>>>>>>>
>>>>>>> thanks,
>>>>>>> rohitk
>>>>>>>
>>>>>>> PS: Thanks for the patience. It took couple of months to get back on
>>>>>>> this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.bl...@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.bl...@similarweb.com
> www.similarweb.com
> <https://www.facebook.com/SimilarWeb/>
> <https://www.linkedin.com/company/429838/>
> <https://twitter.com/similarweb>
>


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Fawze Abujaber
It's super amazing  i see it was tested on spark 2.0.0 and above, what
about Spark 1.6 which is still part of Cloudera's main versions?

We have a vast Spark applications with version 1.6.0

On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau  wrote:

> Super exciting! I look forward to digging through it this weekend.
>
> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
> ravishankar.n...@gmail.com> wrote:
>
>> Excellent. You filled a missing link.
>>
>> Best,
>> Passion
>>
>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia 
>> wrote:
>>
>>> Hi,
>>>
>>> Happy to announce the availability of Sparklens as open source project.
>>> It helps in understanding the  scalability limits of spark applications and
>>> can be a useful guide on the path towards tuning applications for lower
>>> runtime or cost.
>>>
>>> Please clone from here: https://github.com/qubole/sparklens
>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
>>> spark-tuning-tool/
>>>
>>> thanks,
>>> rohitk
>>>
>>> PS: Thanks for the patience. It took couple of months to get back on
>>> this.
>>>
>>>
>>>
>>>
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Fawze Abujaber
It's recommended to sue executor-cores of 5.

Each executor here will utilize 20 GB which mean the spark job will utilize
50 cpu cores and 100GB memory.

You can not run more than 4 executors because your cluster doesn't have
enough memory.

Use see 5 executor because 4 for the job and one for the application master.

serr the used menory and the total memory.

On Mon, Feb 26, 2018 at 12:20 PM, Selvam Raman  wrote:

> Hi,
>
> spark version - 2.0.0
> spark distribution - EMR 5.0.0
>
> Spark Cluster - one master, 5 slaves
>
> Master node - m3.xlarge - 8 vCore, 15 GiB memory, 80 SSD GB storage
> Slave node - m3.2xlarge - 16 vCore, 30 GiB memory, 160 SSD GB storage
>
>
> Cluster Metrics
> Apps SubmittedApps PendingApps RunningApps CompletedContainers RunningMemory
> UsedMemory TotalMemory ReservedVCores UsedVCores TotalVCores ReservedActive
> NodesDecommissioning NodesDecommissioned NodesLost NodesUnhealthy 
> NodesRebooted
> Nodes
> 16 0 1 15 5 88.88 GB 90.50 GB 22 GB 5 79 1 5
>  0
>  0
>  5
>  0
>  0
> 
> I have submitted job with below configuration
> --num-executors 5 --executor-cores 10 --executor-memory 20g
>
>
>
> spark.task.cpus - be default 1
>
>
> My understanding is there will be 5 executore each can run 10 task at a
> time and task can share total memory of 20g. Here, i could see only 5
> vcores used which means 1 executor instance use 20g+10%overhead ram(22gb),
> 10 core(number of threads), 1 Vcore(cpu).
>
> please correct me if my understand is wrong.
>
> how can i utilize number of vcore in EMR effectively. Will Vcore boost
> performance?
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Scala version changed in spark job

2018-01-24 Thread Fawze Abujaber
Hi all,

I upgraded my Hadoop cluster which include spark 1.6.0, I noticed that
sometimes the job is running with scala version 2.10.5 and sometimes with
2.10.4, any idea why this happening?


Re: Spark application on yarn cluster clarification

2018-01-18 Thread Fawze Abujaber
Hi Soheil,

Resource manager and NodeManager are enough, of your you need the roles of
DataNode and NameNode to be able accessing the Data.

On Thu, 18 Jan 2018 at 10:12 Soheil Pourbafrani 
wrote:

> I am setting up a Yarn cluster to run Spark applications on that, but I'm
> confused a bit!
>
> Consider I have a 4-node yarn cluster including one resource manager and 3
> node manager and spark are installed in all 4 nodes.
>
> Now my question is when I want to submit spark application to yarn
> cluster, is it needed spark daemons (both master and slaves) to be running,
> or not, running just resource and node managers suffice?
>
> Thanks
>