Re: IDE suitable for Spark : Monitoring & Debugging Spark Jobs

2020-04-07 Thread Som Lima
The definitive guide
Chapter 18:
Monitoring and Debugging

"This chapter covers the key details you need to monitor and debug your
Spark Applications.  To do this , we will walk through the spark UI with an
example query designed to help you understand how to trace your  own jobs
through the executions life cycle. The example we'll look at will also help
you to understand  how to debug your jobs and where errors are likely to
occur."








On Tue, 7 Apr 2020, 18:28 Pat Ferrel,  wrote:

> IntelliJ Scala works well when debugging master=local. Has anyone used it
> for remote/cluster debugging? I’ve heard it is possible...
>
>
> From: Luiz Camargo  
> Reply: Luiz Camargo  
> Date: April 7, 2020 at 10:26:35 AM
> To: Dennis Suhari 
> 
> Cc: yeikel valdes  ,
> zahidr1...@gmail.com  ,
> user@spark.apache.org  
> Subject:  Re: IDE suitable for Spark
>
> I have used IntelliJ Spark/Scala with the sbt tool
>
> On Tue, Apr 7, 2020 at 1:18 PM Dennis Suhari 
> wrote:
>
>> We are using Pycharm resp. R Studio with Spark libraries to submit Spark
>> Jobs.
>>
>> Von meinem iPhone gesendet
>>
>> Am 07.04.2020 um 18:10 schrieb yeikel valdes :
>>
>> 
>>
>> Zeppelin is not an IDE but a notebook.  It is helpful to experiment but
>> it is missing a lot of the features that we expect from an IDE.
>>
>> Thanks for sharing though.
>>
>>  On Tue, 07 Apr 2020 04:45:33 -0400 * zahidr1...@gmail.com
>>  * wrote 
>>
>> When I first logged on I asked if there was a suitable IDE for Spark.
>> I did get a couple of responses.
>> *Thanks.*
>>
>> I did actually find one which is suitable IDE for spark.
>> That is  *Apache Zeppelin.*
>>
>> One of many reasons it is suitable for Apache Spark is.
>> The  *up and running Stage* which involves typing *bin/zeppelin-daemon.sh
>> start*
>> Go to browser and type *http://localhost:8080 *
>> That's it!
>>
>> Then to
>> * Hit the ground running*
>> There are also ready to go Apache Spark examples
>> showing off the type of functionality one will be using in real life
>> production.
>>
>> Zeppelin comes with  embedded Apache Spark  and scala as default
>> interpreter with 20 + interpreters.
>> I have gone on to discover there are a number of other advantages for
>> real time production
>> environment with Zeppelin offered up by other Apache Products.
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>>
>>
>
> --
>
>
> Prof. Luiz Camargo
> Educador - Computação
>
>
>


Re: Scala version compatibility

2020-04-07 Thread Koert Kuipers
i think it will work then assuming the callsite hasnt changed between scala
versions

On Mon, Apr 6, 2020 at 5:09 PM Andrew Melo  wrote:

> Hello,
>
> On Mon, Apr 6, 2020 at 3:31 PM Koert Kuipers  wrote:
>
>> actually i might be wrong about this. did you declare scala to be a
>> provided dependency? so scala is not in your fat/uber jar? if so then maybe
>> it will work.
>>
>
> I declare spark to be a provided dependency, so Scala's not included in my
> artifact except for this single callsite.
>
> Thanks
> Andrew
>
>
>> On Mon, Apr 6, 2020 at 4:16 PM Andrew Melo  wrote:
>>
>>>
>>>
>>> On Mon, Apr 6, 2020 at 3:08 PM Koert Kuipers  wrote:
>>>
 yes it will


>>> Ooof, I was hoping that wasn't the case. I guess I need to figure out
>>> how to get Maven to compile/publish jars with different
>>> dependencies/artifactIDs like how sbt does? (or re-implement the
>>> functionality in java)
>>>
>>> Thanks for your help,
>>> Andrew
>>>
>>>
 On Mon, Apr 6, 2020 at 3:50 PM Andrew Melo 
 wrote:

> Hello all,
>
> I'm aware that Scala is not binary compatible between revisions. I
> have some Java code whose only Scala dependency is the transitive
> dependency through Spark. This code calls a Spark API which returns a
> Seq, which I then convert into a List with
> JavaConverters.seqAsJavaListConverter. Will this usage cause binary
> incompatibility if the jar is compiled in one Scala version and executed 
> in
> another?
>
> I tried grokking
> https://docs.scala-lang.org/overviews/core/binary-compatibility-of-scala-releases.html,
> and wasn't quite able to make heads or tails of this particular case.
>
> Thanks!
> Andrew
>
>
>


Re: Serialization or internal functions?

2020-04-07 Thread Som Lima
Go to localhost:4040

While sparksession is running.

Go to localhost:4040

Select Stages from menu option.

Select Job you are interested in.


You can select additional metrics

Including  DAG visualisation.





On Tue, 7 Apr 2020, 17:14 yeikel valdes,  wrote:

> Thanks for your input Soma , but I am actually looking to understand the
> differences and not only on the performance.
>
>  On Sun, 05 Apr 2020 02:21:07 -0400 * somplastic...@gmail.com
>  * wrote 
>
> If you want to  measure optimisation in terms of time taken , then here is
> an idea  :)
>
>
> public class MyClass {
> public static void main(String args[])
> throws InterruptedException
> {
>   long start  =  System.currentTimeMillis();
>
> // replace with your add column code
> // enough data to measure
>Thread.sleep(5000);
>
>  long end  = System.currentTimeMillis();
>
>int timeTaken = 0;
>   timeTaken = (int) (end  - start );
>
>   System.out.println("Time taken  " + timeTaken) ;
> }
> }
>
> On Sat, 4 Apr 2020, 19:07 ,  wrote:
>
> Dear Community,
>
>
>
> Recently, I had to solve the following problem “for every entry of a
> Dataset[String], concat a constant value” , and to solve it, I used
> built-in functions :
>
>
>
> val data = Seq("A","b","c").toDS
>
>
>
> scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit("
> "),lit("concat"))).select("valueconcat").explain()
>
> == Physical Plan ==
>
> LocalTableScan [valueconcat#161]
>
>
>
> As an alternative , a much simpler version of the program is to use map,
> but it adds a serialization step that does not seem to be present for the
> version above :
>
>
>
> scala> data.map(e=> s"$e concat").explain
>
> == Physical Plan ==
>
> *(1) SerializeFromObject [staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0,
> java.lang.String, true], true, false) AS value#92]
>
> +- *(1) MapElements , obj#91: java.lang.String
>
>+- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String
>
>   +- LocalTableScan [value#12]
>
>
>
> Is this over-optimization or is this the right way to go?
>
>
>
> As a follow up , is there any better API to get the one and only column
> available in a DataSet[String] when using built-in functions?
> “col(data.columns.head)” works but it is not ideal.
>
>
>
> Thanks!
>
>
>


Re: IDE suitable for Spark

2020-04-07 Thread Nikaash Puri
I think as long as you set the master in the code as the correct cluster URL, 
everything should work as expected. So you should be able to place breakpoints 
in IntelliJ as if running in local mode.

Get Outlook for iOS

From: Pat Ferrel 
Sent: Tuesday, April 7, 2020 10:58:39 PM
To: Luiz Camargo ; Dennis Suhari 

Cc: zahidr1...@gmail.com ; user@spark.apache.org 
; yeikel valdes 
Subject: Re: IDE suitable for Spark

IntelliJ Scala works well when debugging master=local. Has anyone used it for 
remote/cluster debugging? I’ve heard it is possible...


From: Luiz Camargo 
Reply: Luiz Camargo 
Date: April 7, 2020 at 10:26:35 AM
To: Dennis Suhari 

Cc: yeikel valdes , 
zahidr1...@gmail.com , 
user@spark.apache.org
Subject:  Re: IDE suitable for Spark

I have used IntelliJ Spark/Scala with the sbt tool

On Tue, Apr 7, 2020 at 1:18 PM Dennis Suhari  
wrote:
We are using Pycharm resp. R Studio with Spark libraries to submit Spark Jobs.

Von meinem iPhone gesendet

Am 07.04.2020 um 18:10 schrieb yeikel valdes 
mailto:em...@yeikel.com>>:



Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
missing a lot of the features that we expect from an IDE.

Thanks for sharing though.

 On Tue, 07 Apr 2020 04:45:33 -0400 
zahidr1...@gmail.comwrote 

When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.
Thanks.

I did actually find one which is suitable IDE for spark.
That is  Apache Zeppelin.

One of many reasons it is suitable for Apache Spark is.
The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
Go to browser and type http://localhost:8080
That's it!

Then to
Hit the ground running
There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life production.

Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real time 
production
environment with Zeppelin offered up by other Apache Products.

Backbutton.co.uk
¯_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org




--


Prof. Luiz Camargo
Educador - Computação



Re: IDE suitable for Spark

2020-04-07 Thread Pat Ferrel
IntelliJ Scala works well when debugging master=local. Has anyone used it for 
remote/cluster debugging? I’ve heard it is possible...


From: Luiz Camargo 
Reply: Luiz Camargo 
Date: April 7, 2020 at 10:26:35 AM
To: Dennis Suhari 
Cc: yeikel valdes , zahidr1...@gmail.com 
, user@spark.apache.org 
Subject:  Re: IDE suitable for Spark  

I have used IntelliJ Spark/Scala with the sbt tool

On Tue, Apr 7, 2020 at 1:18 PM Dennis Suhari  
wrote:
We are using Pycharm resp. R Studio with Spark libraries to submit Spark Jobs. 

Von meinem iPhone gesendet

Am 07.04.2020 um 18:10 schrieb yeikel valdes :



Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
missing a lot of the features that we expect from an IDE.

Thanks for sharing though. 

 On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote 

When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.  
Thanks.  

I did actually find one which is suitable IDE for spark.  
That is  Apache Zeppelin.

One of many reasons it is suitable for Apache Spark is.
The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
Go to browser and type http://localhost:8080  
That's it!

Then to
Hit the ground running   
There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life production.

Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real time 
production
environment with Zeppelin offered up by other Apache Products.

Backbutton.co.uk
¯\_(ツ)_/¯  
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



--  


Prof. Luiz Camargo
Educador - Computação



Re: IDE suitable for Spark

2020-04-07 Thread Luiz Camargo
I have used IntelliJ Spark/Scala with the sbt tool

On Tue, Apr 7, 2020 at 1:18 PM Dennis Suhari 
wrote:

> We are using Pycharm resp. R Studio with Spark libraries to submit Spark
> Jobs.
>
> Von meinem iPhone gesendet
>
> Am 07.04.2020 um 18:10 schrieb yeikel valdes :
>
> 
>
> Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it
> is missing a lot of the features that we expect from an IDE.
>
> Thanks for sharing though.
>
>  On Tue, 07 Apr 2020 04:45:33 -0400 * zahidr1...@gmail.com
>  * wrote 
>
> When I first logged on I asked if there was a suitable IDE for Spark.
> I did get a couple of responses.
> *Thanks.*
>
> I did actually find one which is suitable IDE for spark.
> That is  *Apache Zeppelin.*
>
> One of many reasons it is suitable for Apache Spark is.
> The  *up and running Stage* which involves typing *bin/zeppelin-daemon.sh
> start*
> Go to browser and type *http://localhost:8080 *
> That's it!
>
> Then to
> * Hit the ground running*
> There are also ready to go Apache Spark examples
> showing off the type of functionality one will be using in real life
> production.
>
> Zeppelin comes with  embedded Apache Spark  and scala as default
> interpreter with 20 + interpreters.
> I have gone on to discover there are a number of other advantages for real
> time production
> environment with Zeppelin offered up by other Apache Products.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
>

-- 


Prof. Luiz Camargo
Educador - Computação


Spark Union Breaks Caching Behaviour

2020-04-07 Thread Yi Huang
Dear Community,

I am a beginner of using Spark. I am confused by the comment of the
following method.

def union(other: Dataset[T]): Dataset[T] = withSetOperator {
  // This breaks caching, but it's usually ok because it addresses a very
specific use case:
  // using union to union many files or partitions.
  CombineUnions(Union(logicalPlan,
other.logicalPlan)).mapChildren(AnalysisBarrier)
}

and here is the corresponding PR comment
https://github.com/apache/spark/pull/10577#discussion_r48820132


Another option would just be to do this at construction time, that way we
can avoid paying the cost in the analyzer. *This would still limit the
cases we could cache (i.e. we'd miss cached data unioned with other data),
but that doesn't seem like a huge deal.*


Could anyone please kindly explain to me what does *This breaks caching *mean?
It would be awesome if an example is given.

Best regards,
Yi Huang


Re:  IDE suitable for Spark

2020-04-07 Thread Dennis Suhari
We are using Pycharm resp. R Studio with Spark libraries to submit Spark Jobs. 

Von meinem iPhone gesendet

> Am 07.04.2020 um 18:10 schrieb yeikel valdes :
> 
> 
> 
> Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
> missing a lot of the features that we expect from an IDE.
> 
> Thanks for sharing though. 
> 
>  On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote 
> 
> When I first logged on I asked if there was a suitable IDE for Spark.
> I did get a couple of responses. 
> Thanks. 
> 
> I did actually find one which is suitable IDE for spark. 
> That is  Apache Zeppelin.
> 
> One of many reasons it is suitable for Apache Spark is.
> The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
> Go to browser and type http://localhost:8080 
> That's it!
> 
> Then to
> Hit the ground running  
> There are also ready to go Apache Spark examples
> showing off the type of functionality one will be using in real life 
> production.
> 
> Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
> with 20 + interpreters.
> I have gone on to discover there are a number of other advantages for real 
> time production
> environment with Zeppelin offered up by other Apache Products.
> 
> Backbutton.co.uk
> ¯\_(ツ)_/¯ 
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 


Re: IDE suitable for Spark

2020-04-07 Thread Stephen Boesch
I have been using  Idea for both scala/spark and pyspark projects since
2013. It required fair amount of fiddling that first year but has been
stable since early 2015.   For pyspark projects only Pycharm naturally also
works v well.

Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes :

>
> Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it
> is missing a lot of the features that we expect from an IDE.
>
> Thanks for sharing though.
>
>  On Tue, 07 Apr 2020 04:45:33 -0400 * zahidr1...@gmail.com
>  * wrote 
>
> When I first logged on I asked if there was a suitable IDE for Spark.
> I did get a couple of responses.
> *Thanks.*
>
> I did actually find one which is suitable IDE for spark.
> That is  *Apache Zeppelin.*
>
> One of many reasons it is suitable for Apache Spark is.
> The  *up and running Stage* which involves typing *bin/zeppelin-daemon.sh
> start*
> Go to browser and type *http://localhost:8080 *
> That's it!
>
> Then to
> * Hit the ground running*
> There are also ready to go Apache Spark examples
> showing off the type of functionality one will be using in real life
> production.
>
> Zeppelin comes with  embedded Apache Spark  and scala as default
> interpreter with 20 + interpreters.
> I have gone on to discover there are a number of other advantages for real
> time production
> environment with Zeppelin offered up by other Apache Products.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
>


Re: Serialization or internal functions?

2020-04-07 Thread yeikel valdes
Thanks for your input Soma , but I am actually looking to understand the 
differences and not only on the performance. 


 On Sun, 05 Apr 2020 02:21:07 -0400 somplastic...@gmail.com wrote 


If you want to  measure optimisation in terms of time taken , then here is an 
idea  :)  




public class MyClass {
    public static void main(String args[]) 
    throws InterruptedException
    {
          long start  =  System.currentTimeMillis();
      
// replace with your add column code
// enough data to measure 
       Thread.sleep(5000);
  
     long end  = System.currentTimeMillis();
     
       int timeTaken = 0;
      timeTaken = (int) (end  - start );


      System.out.println("Time taken  " + timeTaken) ;
    }
}


On Sat, 4 Apr 2020, 19:07 ,  wrote:


Dear Community,

 

Recently, I had to solve the following problem “for every entry of a 
Dataset[String], concat a constant value” , and to solve it, I used built-in 
functions :

 

val data = Seq("A","b","c").toDS

 

scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit(" 
"),lit("concat"))).select("valueconcat").explain()

== Physical Plan ==

LocalTableScan [valueconcat#161]

 

As an alternative , a much simpler version of the program is to use map, but it 
adds a serialization step that does not seem to be present for the version 
above :

 

scala> data.map(e=> s"$e concat").explain

== Physical Plan ==

*(1) SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true, false) AS value#92]

+- *(1) MapElements , obj#91: java.lang.String

   +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String

  +- LocalTableScan [value#12]

 

Is this over-optimization or is this the right way to go?  

 

As a follow up , is there any better API to get the one and only column 
available in a DataSet[String] when using built-in functions? 
“col(data.columns.head)” works but it is not ideal.

 

Thanks!

Re: IDE suitable for Spark

2020-04-07 Thread yeikel valdes

Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
missing a lot of the features that we expect from an IDE.


Thanks for sharing though. 


 On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote 


When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.

Thanks.



I did actually find one which is suitable IDE for spark.

That is  Apache Zeppelin.


One of many reasons it is suitable for Apache Spark is.
The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
Go to browser and type http://localhost:8080

That's it!


Then to
Hit the ground running 

There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life production.



Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real time 
production
environment with Zeppelin offered up by other Apache Products.



Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org


IDE suitable for Spark

2020-04-07 Thread Zahid Rahman
When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.
*Thanks.*

I did actually find one which is suitable IDE for spark.
That is  *Apache Zeppelin.*

One of many reasons it is suitable for Apache Spark is.
The  *up and running Stage* which involves typing *bin/zeppelin-daemon.sh
start*
Go to browser and type *http://localhost:8080 *
That's it!

Then to
* Hit the ground running*
There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life
production.

Zeppelin comes with  embedded Apache Spark  and scala as default
interpreter with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real
time production
environment with Zeppelin offered up by other Apache Products.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



Lifecycle of a map function

2020-04-07 Thread Vadim Vararu
Hi all,

I'm trying to guess understand what is the lifecycle of a map function in 
spark/yarn context. My understanding is that function is instantiated on the 
master and then passed to each executor (serialized/deserialized).

What I'd like to confirm is that the function is 
initialized/loaded/deserialized once per executor (JVM in yarn) and lives as 
long as executor lives and not once per task (logical unit of work to do).

Could you please explain or, better, give some links to source code or 
documentation? I've tried to take a look in Task.scala and ResultTask.scala but 
I'm not familiar with Scala and didn't find where exactly is function lifecycle 
managed.


Thanks in advance,
Vadim.