Re: Kotlin Spark API

2020-07-14 Thread Anwar AliKhan
Is kotlin another new language ?

GRADY BOOCH;  The United States Department of defence (DOD) is perhaps the
largest user of computers in the world. By the mid-1970s, software
development for its systems had reached crisis proportions: projects were
often late, over budget and they often failed to meet their stated
requirements. It was evident that the problems would only worsen as
software development costs continued to rise exponentially. To help
resolve these
problems which were further compounded by the proliferation of hundreds of
different languages. The DOD sponsored the development of a single, common
high order programming language. The winning design was originally called
the Green Language (so called because of its team colour code during the
competition), and was renamed ADA



On Tue, 14 Jul 2020, 18:42 Maria Khalusova,  wrote:

> Hi folks,
>
> We would love your feedback on the new Kotlin Spark API that we are
> working on: https://github.com/JetBrains/kotlin-spark-api.
>
> Why Kotlin Spark API? Kotlin developers can already use Kotlin with the
> existing Apache Spark Java API, however they cannot take full advantage of
> Kotlin language features. With Kotlin Spark API, you can use Kotlin data
> classes and lambda expressions.
>
> The API also adds some helpful extension functions. For example, you can
> use `withCached` to perform arbitrary transformations on a Dataset and not
> worry about the Dataset unpersisting at the end.
>
> If you like Kotlin and would like to try the API, we've prepared a Quick
> Start Guide to help you set up all the needed dependencies in no time using
> either Maven or Gradle:
> https://github.com/JetBrains/kotlin-spark-api/blob/master/docs/quick-start-guide.md
>
> In the repo, you’ll also find a few code examples to get an idea of what
> the API looks like:
> https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
>
> We’d love to see your feedback in the project’s GitHub issues:
> https://github.com/JetBrains/kotlin-spark-api/issues.
>
>
> Thanks!
>
>
>


Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Anwar AliKhan
Ok, thanks.
You can buy it here

https://www.amazon.com/s?k=hands+on+machine+learning+with+scikit-learn+and+tensorflow+2=2U0P9XVIJ790T=Hands+on+machine+%2Caps%2C246=nb_sb_ss_i_1_17

This book is like an accompaniment to the Andrew Ng course on coursera.
It uses exact same mathematical notations , examples etc. so it is a smooth
transition from that courses.




On Tue, 14 Jul 2020, 15:52 Sean Owen,  wrote:

> It is still copyrighted material, no matter its state of editing. Yes,
> you should not be sharing this on the internet.
>
> On Tue, Jul 14, 2020 at 9:46 AM Anwar AliKhan 
> wrote:
> >
> > Please note It is freely available because it is an early unedited raw
> edition.
> > It is not 100% complete , it is not entirely same as yours.
> > So it is not piracy.
> > I agree it is a good book.
> >
>


Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Anwar AliKhan
Please note It is freely available because it is an early unedited raw
edition.
It is not 100% complete , it is not entirely same as yours.
So it is not piracy.
I agree it is a good book.







On Tue, 14 Jul 2020, 14:30 Patrick McCarthy, 
wrote:

> Please don't advocate for piracy, this book is not freely available.
>
> I own it and it's wonderful, Mr. Géron deserves to benefit from it.
>
> On Mon, Jul 13, 2020 at 9:59 PM Anwar AliKhan 
> wrote:
>
>>  link to a free book  which may be useful.
>>
>> Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
>> Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien
>> Géron
>>
>> https://bit.ly/2zxueGt
>>
>>
>>
>>
>>
>>  13 Jul 2020, 15:18 Sean Owen,  wrote:
>>
>>> There is a multilayer perceptron implementation in Spark ML, but
>>> that's not what you're looking for.
>>> To parallelize model training developed using standard libraries like
>>> Keras, use Horovod from Uber.
>>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>>
>>> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan 
>>> wrote:
>>> >
>>> > Dear Spark User
>>> >
>>> > I am trying to parallelize the CNN (convolutional neural network)
>>> model using spark. I have developed the model using python and Keras
>>> library. The model works fine on a single machine but when we try on
>>> multiple machines, the execution time remains the same as sequential.
>>> > Could you please tell me that there is any built-in library for CNN to
>>> parallelize in spark framework. Moreover, MLLIB does not have any support
>>> for CNN.
>>> > Best regards
>>> > Mukhtaj
>>> >
>>> >
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Anwar AliKhan
 link to a free book  which may be useful.

Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien
Géron

https://bit.ly/2zxueGt





 13 Jul 2020, 15:18 Sean Owen,  wrote:

> There is a multilayer perceptron implementation in Spark ML, but
> that's not what you're looking for.
> To parallelize model training developed using standard libraries like
> Keras, use Horovod from Uber.
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
> >
> > Dear Spark User
> >
> > I am trying to parallelize the CNN (convolutional neural network) model
> using spark. I have developed the model using python and Keras library. The
> model works fine on a single machine but when we try on multiple machines,
> the execution time remains the same as sequential.
> > Could you please tell me that there is any built-in library for CNN to
> parallelize in spark framework. Moreover, MLLIB does not have any support
> for CNN.
> > Best regards
> > Mukhtaj
> >
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Anwar AliKhan
This is very useful for me leading on from week4 of the Andrew Ng course.


On Mon, 13 Jul 2020, 15:18 Sean Owen,  wrote:

> There is a multilayer perceptron implementation in Spark ML, but
> that's not what you're looking for.
> To parallelize model training developed using standard libraries like
> Keras, use Horovod from Uber.
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
> >
> > Dear Spark User
> >
> > I am trying to parallelize the CNN (convolutional neural network) model
> using spark. I have developed the model using python and Keras library. The
> model works fine on a single machine but when we try on multiple machines,
> the execution time remains the same as sequential.
> > Could you please tell me that there is any built-in library for CNN to
> parallelize in spark framework. Moreover, MLLIB does not have any support
> for CNN.
> > Best regards
> > Mukhtaj
> >
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Blog : Apache Spark Window Functions

2020-07-13 Thread Anwar AliKhan
Further to the feedback you requested ,
I forgot to mention another point , that with the insight you will gain
after three weeks spent  on  that course,
You will be on par with the  aformentioned minority of engineers who are
helping their companies "make tons of money" a quote from Professor Andrew
Ng.

You will no longer be  part of the majority of engineers
who are spending six months on analytical projects when from day one YOU
can see that wasn't going to work. Another quote from professor Andrew Ng.


If you value the idea of joining the minority engineers "making tons of
money
for companies"  then the same three weeks spent on that course will yield
greater value comparatively spent on writing Apache spark examples of the
type you are currently engaged in.

I have gone past week 3 蘿so I have the insight.

It is against my personal values to use a product which is given on a trial
period basis , so I use Free Octave , a project  started 32 years ago.
You can profit by MATLAB 's investment.
You can watch MATLAB videos on how use and apply what you have learnt  to
Octave  because the syntax is exactly the same.

Then you can parallelise your octave  app on Apache Spark. You can use
Apache spark on a standalone whilst you prototype then with one line of
code,  change the parallelism to a distributed  parallelism across
cluster(s) of PCs.



On Fri, 10 Jul 2020, 04:50 Anwar AliKhan,  wrote:

> My opinion would be go here.
>
> https://www.coursera.org/courses?query=machine%20learning%20andrew%20ng
>
> Machine learning by Andrew Ng.
>
> After three weeks you will have more valuable skills than most engineers
> in silicon valley in the USA. I am past week 3. 蘿
>
> He does go 90 miles per hour.
> I  wish somebody had pointed me there as the starting point.
>
>
>
> On Thu, 25 Jun 2020, 18:58 neeraj bhadani, 
> wrote:
>
>> Hi Team,
>>  I would like to share with the community that my blog on "Apache
>> Spark Window Functions" got published. PFB link if anyone interested.
>>
>> Link:
>> https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-window-functions-7b4e39ad3c86
>>
>> Please share your thoughts and feedback.
>>
>> Regards,
>> Neeraj
>>
>


Re: Blog : Apache Spark Window Functions

2020-07-09 Thread Anwar AliKhan
My opinion would be go here.

https://www.coursera.org/courses?query=machine%20learning%20andrew%20ng

Machine learning by Andrew Ng.

After three weeks you will have more valuable skills than most engineers in
silicon valley in the USA. I am past week 3. 蘿

He does go 90 miles per hour.
I  wish somebody had pointed me there as the starting point.



On Thu, 25 Jun 2020, 18:58 neeraj bhadani, 
wrote:

> Hi Team,
>  I would like to share with the community that my blog on "Apache
> Spark Window Functions" got published. PFB link if anyone interested.
>
> Link:
> https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-window-functions-7b4e39ad3c86
>
> Please share your thoughts and feedback.
>
> Regards,
> Neeraj
>


Re: When is a Bigint a long and when is a long a long

2020-06-28 Thread Anwar AliKhan
I wish to draw your attention for your consideration to  this  approach
where the BigInt data type maps to Long without drawing an error.

https://stackoverflow.com/questions/31011797/bug-in-spring-data-jpa-spring-data-returns-listbiginteger-instead-of-listlon

"This is a issue with Spring data JPA. If in DB the datatype is defined as
BigInteger and in JPA query we tries to fetch as Long then it will not give
any error , but it set value as BigInteger in Long datatype."


The use of spark.range(10).map(_.toLong).reduce(_+_)

means extra processing while iterating through each element by map method
making ready a  new array  for reduce function.  I feel the extra
processing  should be avoided.


On Sat, 27 Jun 2020, 17:36 Sean Owen,  wrote:

> It does not return a DataFrame. It returns Dataset[Long].
> You do not need to collect(). See my email.
>
> On Sat, Jun 27, 2020, 11:33 AM Anwar AliKhan 
> wrote:
>
>> So the range function actually returns BigInt (Spark SQL type)
>> and the fact Dataset[Long] and printSchema are displaying (toString())
>> Long instead of BigInt needs looking into.
>>
>> Putting that to one side
>>
>> My issue with using collect() to get around the casting of elements
>> returned
>> by range is,  I read some literature which says the collect() returns all
>> the data to the driver
>> and so can likely cause Out Of memory error.
>>
>> Question:
>> Is it correct that collect() behaves that way and can cause Out of memory
>> error ?
>>
>> Obviously it will be better to use  .map for casting because then the
>> work is being done by workers.
>> spark.range(10).map(_.toLong),reduce(_+_)
>> <http://www.backbutton.co.uk/>
>>
>>
>> On Sat, 27 Jun 2020, 15:42 Sean Owen,  wrote:
>>
>>> There are several confusing things going on here. I think this is part
>>> of the explanation, not 100% sure:
>>>
>>> 'bigint' is the Spark SQL type of an 8-byte long. 'long' is the type
>>> of a JVM primitive. Both are the same, conceptually, but represented
>>> differently internally as they are logically somewhat different ideas.
>>>
>>> The first thing I'm not sure about is why the toString of
>>> Dataset[Long] reports a 'bigint' and printSchema() reports 'long'.
>>> That might be a (cosmetic) bug.
>>>
>>> Second, in Scala 2.12, its SAM support causes calls to reduce() and
>>> other methods, using an Object type, to be ambiguous, because Spark
>>> has long since had Java-friendly overloads that support a SAM
>>> interface for Java callers. Those weren't removed to avoid breakage,
>>> at the cost of having to explicitly tell it what overload you want.
>>> (They are equivalent)
>>>
>>> This is triggered because range() returns java.lang.Longs, not long
>>> primitives (i.e. scala.Long). I assume that is to make it versatile
>>> enough to use in Java too, and because it's hard to write an overload
>>> (would have to rename it)
>>>
>>> But that means you trigger the SAM overload issue.
>>>
>>> Anything you do that makes this a Dataset[scala.Long] resolves it, as
>>> it is no longer ambiguous (Java-friendly Object-friendly overload does
>>> not apply). For example:
>>>
>>> spark.range(10).map(_.toLong).reduce(_+_)
>>>
>>> If you collect(), you still have an Array[java.lang.Long]. But Scala
>>> implicits and conversions make .reduce(_+_) work fine on that; there
>>> is no "Java-friendly" overload in the way.
>>>
>>> Normally all of this just works and you can ignore these differences.
>>> This is a good example of a corner case in which it's inconvenient,
>>> because of the old Java-friendly overloads. This is by design though.
>>>
>>> On Sat, Jun 27, 2020 at 8:29 AM Anwar AliKhan 
>>> wrote:
>>> >
>>> > As you know I have been puzzling over this issue :
>>> > How come spark.range(100).reduce(_+_)
>>> > worked in earlier spark version but not with the most recent versions.
>>> >
>>> > well,
>>> >
>>> > When you first create a dataset, by default the column "id" datatype
>>> is  [BigInt],
>>> > It is a bit like a coin Long on one side and bigint on the other side.
>>> >
>>> > scala> val myrange = spark.range(1,100)
>>> > myrange: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>>> >
>>> > The Spark framework error message after parsing the reduce(_+_) method
>>> conf

Re: When is a Bigint a long and when is a long a long

2020-06-27 Thread Anwar AliKhan
OK Thanks

On Sat, 27 Jun 2020, 17:36 Sean Owen,  wrote:

> It does not return a DataFrame. It returns Dataset[Long].
> You do not need to collect(). See my email.
>
> On Sat, Jun 27, 2020, 11:33 AM Anwar AliKhan 
> wrote:
>
>> So the range function actually returns BigInt (Spark SQL type)
>> and the fact Dataset[Long] and printSchema are displaying (toString())
>> Long instead of BigInt needs looking into.
>>
>> Putting that to one side
>>
>> My issue with using collect() to get around the casting of elements
>> returned
>> by range is,  I read some literature which says the collect() returns all
>> the data to the driver
>> and so can likely cause Out Of memory error.
>>
>> Question:
>> Is it correct that collect() behaves that way and can cause Out of memory
>> error ?
>>
>> Obviously it will be better to use  .map for casting because then the
>> work is being done by workers.
>> spark.range(10).map(_.toLong),reduce(_+_)
>> <http://www.backbutton.co.uk/>
>>
>>
>> On Sat, 27 Jun 2020, 15:42 Sean Owen,  wrote:
>>
>>> There are several confusing things going on here. I think this is part
>>> of the explanation, not 100% sure:
>>>
>>> 'bigint' is the Spark SQL type of an 8-byte long. 'long' is the type
>>> of a JVM primitive. Both are the same, conceptually, but represented
>>> differently internally as they are logically somewhat different ideas.
>>>
>>> The first thing I'm not sure about is why the toString of
>>> Dataset[Long] reports a 'bigint' and printSchema() reports 'long'.
>>> That might be a (cosmetic) bug.
>>>
>>> Second, in Scala 2.12, its SAM support causes calls to reduce() and
>>> other methods, using an Object type, to be ambiguous, because Spark
>>> has long since had Java-friendly overloads that support a SAM
>>> interface for Java callers. Those weren't removed to avoid breakage,
>>> at the cost of having to explicitly tell it what overload you want.
>>> (They are equivalent)
>>>
>>> This is triggered because range() returns java.lang.Longs, not long
>>> primitives (i.e. scala.Long). I assume that is to make it versatile
>>> enough to use in Java too, and because it's hard to write an overload
>>> (would have to rename it)
>>>
>>> But that means you trigger the SAM overload issue.
>>>
>>> Anything you do that makes this a Dataset[scala.Long] resolves it, as
>>> it is no longer ambiguous (Java-friendly Object-friendly overload does
>>> not apply). For example:
>>>
>>> spark.range(10).map(_.toLong).reduce(_+_)
>>>
>>> If you collect(), you still have an Array[java.lang.Long]. But Scala
>>> implicits and conversions make .reduce(_+_) work fine on that; there
>>> is no "Java-friendly" overload in the way.
>>>
>>> Normally all of this just works and you can ignore these differences.
>>> This is a good example of a corner case in which it's inconvenient,
>>> because of the old Java-friendly overloads. This is by design though.
>>>
>>> On Sat, Jun 27, 2020 at 8:29 AM Anwar AliKhan 
>>> wrote:
>>> >
>>> > As you know I have been puzzling over this issue :
>>> > How come spark.range(100).reduce(_+_)
>>> > worked in earlier spark version but not with the most recent versions.
>>> >
>>> > well,
>>> >
>>> > When you first create a dataset, by default the column "id" datatype
>>> is  [BigInt],
>>> > It is a bit like a coin Long on one side and bigint on the other side.
>>> >
>>> > scala> val myrange = spark.range(1,100)
>>> > myrange: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>>> >
>>> > The Spark framework error message after parsing the reduce(_+_) method
>>> confirms this
>>> > and moreover stresses its constraints of expecting data  type long as
>>> parameter argument(s).
>>> >
>>> > scala> myrange.reduce(_+_)
>>> > :26: error: overloaded method value reduce with alternatives:
>>> >   (func:
>>> org.apache.spark.api.java.function.ReduceFunction[java.lang.Long])java.lang.Long
>>> 
>>> >   (func: (java.lang.Long, java.lang.Long) =>
>>> java.lang.Long)java.lang.Long
>>> >  cannot be applied to ((java.lang.Long, java.lang.Long) => scala.Long)
>>> >myrange.reduce(_+_)
>>> >^
>>>

Re: When is a Bigint a long and when is a long a long

2020-06-27 Thread Anwar AliKhan
So the range function actually returns BigInt (Spark SQL type)
and the fact Dataset[Long] and printSchema are displaying (toString())
Long instead of BigInt needs looking into.

Putting that to one side

My issue with using collect() to get around the casting of elements returned
by range is,  I read some literature which says the collect() returns all
the data to the driver
and so can likely cause Out Of memory error.

Question:
Is it correct that collect() behaves that way and can cause Out of memory
error ?

Obviously it will be better to use  .map for casting because then the work
is being done by workers.
spark.range(10).map(_.toLong),reduce(_+_)
<http://www.backbutton.co.uk/>


On Sat, 27 Jun 2020, 15:42 Sean Owen,  wrote:

> There are several confusing things going on here. I think this is part
> of the explanation, not 100% sure:
>
> 'bigint' is the Spark SQL type of an 8-byte long. 'long' is the type
> of a JVM primitive. Both are the same, conceptually, but represented
> differently internally as they are logically somewhat different ideas.
>
> The first thing I'm not sure about is why the toString of
> Dataset[Long] reports a 'bigint' and printSchema() reports 'long'.
> That might be a (cosmetic) bug.
>
> Second, in Scala 2.12, its SAM support causes calls to reduce() and
> other methods, using an Object type, to be ambiguous, because Spark
> has long since had Java-friendly overloads that support a SAM
> interface for Java callers. Those weren't removed to avoid breakage,
> at the cost of having to explicitly tell it what overload you want.
> (They are equivalent)
>
> This is triggered because range() returns java.lang.Longs, not long
> primitives (i.e. scala.Long). I assume that is to make it versatile
> enough to use in Java too, and because it's hard to write an overload
> (would have to rename it)
>
> But that means you trigger the SAM overload issue.
>
> Anything you do that makes this a Dataset[scala.Long] resolves it, as
> it is no longer ambiguous (Java-friendly Object-friendly overload does
> not apply). For example:
>
> spark.range(10).map(_.toLong).reduce(_+_)
>
> If you collect(), you still have an Array[java.lang.Long]. But Scala
> implicits and conversions make .reduce(_+_) work fine on that; there
> is no "Java-friendly" overload in the way.
>
> Normally all of this just works and you can ignore these differences.
> This is a good example of a corner case in which it's inconvenient,
> because of the old Java-friendly overloads. This is by design though.
>
> On Sat, Jun 27, 2020 at 8:29 AM Anwar AliKhan 
> wrote:
> >
> > As you know I have been puzzling over this issue :
> > How come spark.range(100).reduce(_+_)
> > worked in earlier spark version but not with the most recent versions.
> >
> > well,
> >
> > When you first create a dataset, by default the column "id" datatype is
> [BigInt],
> > It is a bit like a coin Long on one side and bigint on the other side.
> >
> > scala> val myrange = spark.range(1,100)
> > myrange: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> >
> > The Spark framework error message after parsing the reduce(_+_) method
> confirms this
> > and moreover stresses its constraints of expecting data  type long as
> parameter argument(s).
> >
> > scala> myrange.reduce(_+_)
> > :26: error: overloaded method value reduce with alternatives:
> >   (func:
> org.apache.spark.api.java.function.ReduceFunction[java.lang.Long])java.lang.Long
> 
> >   (func: (java.lang.Long, java.lang.Long) =>
> java.lang.Long)java.lang.Long
> >  cannot be applied to ((java.lang.Long, java.lang.Long) => scala.Long)
> >myrange.reduce(_+_)
> >^
> >
> > But if you ask the printSchema method it disagrees with both of the
> above and says the column "id" data is Long.
> > scala> range100.printSchema()
> > root
> >  |-- id: long (nullable = false)
> >
> > If I ask the collect() method, the collect() method  agrees with
> printSchema() that the datatype of column "id" is  Long and not BigInt.
> >
> > scala> range100.collect()
> > res10: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
> 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
> 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
> 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
> 90, 91, 92, 93, 94, 95, 96, 97, 98, 99)
> >
> > To settle the dispute between the methods and get the collect() to "show

When is a Bigint a long and when is a long a long

2020-06-27 Thread Anwar AliKhan
*As you know I have been puzzling over this issue :*
*How come spark.range(100).reduce(_+_)*

*worked in earlier spark version but not with the most recent versions.*

*well,*

*When you first create a dataset, by default the column "id" datatype is
[BigInt],*
*It is a bit like a coin Long on one side and bigint on the other side.*

scala> val myrange = spark.range(1,100)
myrange: org.apache.spark.sql.Dataset[Long] = [id: bigint]

*The Spark framework error message after parsing the reduce(_+_) method
confirms this*

*and moreover stresses its constraints of expecting data  type long as
parameter argument(s).*

scala> myrange.reduce(_+_)
:26: error: overloaded method value reduce with alternatives:
  (func:
org.apache.spark.api.java.function.ReduceFunction[java.lang.Long])java.lang.Long

  (func: (java.lang.Long, java.lang.Long) => java.lang.Long)java.lang.Long
 cannot be applied to ((java.lang.Long, java.lang.Long) => scala.Long)
   myrange.reduce(_+_)
   ^


*But if you ask the printSchema method it disagrees with both of the above
and says the column "id" data is Long.*scala> range100.printSchema()
root
 |-- id: long (nullable = false)


*If I ask the collect() method, the collect() method  agrees with
printSchema() that the datatype of column "id" is  Long and not BigInt.*

scala> range100.collect()
res10: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99)

*To settle the dispute between the methods and get the collect() to "show
me the money" I  called the collect() to pass its return type to
reduce(_+_).*


*"Here is the money"*
scala> range100.collect().reduce(_+_)
res11: Long = 4950

*The collect() and printSchema methods could be implying  there is no
difference between a Long or  a BingInt.*

*Questions :  These return type  differentials, are they  by design  or an
oversight  bug ?*
*Questions :  Why the change from earlier version to later version ?*
*Question   : Will you be updating the reduce(_+_)  method ?*


*When it comes to creating a dataset using toDs there is no dispute,*

*all the methods agree that it is neither a BigInt or a Long but an int
even integer.*

scala> val dataset = Seq(1, 2, 3).toDS()
dataset: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> dataset.collect()
res29: Array[Int] = Array(1, 2, 3)

scala> dataset.printSchema()
root
 |-- value: integer (nullable = false)

scala> dataset.show()
+-+
|value|
+-+
|1|
|2|
|3|
+-+

scala> dataset.reduce(_+_)
res7: Int = 6



Re: Where are all the jars gone ?

2020-06-25 Thread Anwar AliKhan
I know I can  arrive at the same result with this code,

  val range100 = spark.range(1,101).agg((sum('id) as
"sum")).first.get(0)
  println(f"sum of range100 =  $range100")

so I am not stuck,
I was just curious   why the code breaks using the current link
libraries.

spark.range(1,101).reduce(_+_)

spark-submit test

/opt/spark/spark-submit

spark.range(1,101).reduce(_+_)
:24: error: overloaded method value reduce with alternatives:
  (func:
org.apache.spark.api.java.function.ReduceFunction[java.lang.Long])java.lang.Long

  (func: (java.lang.Long, java.lang.Long) => java.lang.Long)java.lang.Long
 cannot be applied to ((java.lang.Long, java.lang.Long) => scala.Long)
   spark.range(1,101).reduce(_+_)
<http://www.backbutton.co.uk/>


On Wed, 24 Jun 2020, 19:54 Anwar AliKhan,  wrote:

>
> I am using the method describe on this page for Scala development in
> eclipse.
>
> https://data-flair.training/blogs/create-spark-scala-project/
>
>
> in the middle of the page you will find
>
>
> *“y**ou will see lots of error due to missing libraries.*
> viii. Add Spark Libraries”
>
>
> Now that I have my own build I will be pointing to the jars (spark
> libraries)
>
> in directory /opt/spark/assembly/target/scala-2.12/jars
>
>
> This way I know exactly the jar libraries I am using to remove the
> formentioned errors.
>
>
> At the same time I am trying to setup a template environment as shown here
>
>
> https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a
>
>
> so that I can have variables sc and spark in the eclipse editor same you
> would have spark, sc variables in the spark-shell.
>
>
> I used the word trying because the following code is broken
>
>
> spark.range(1,101).reduce(_ + _)
>
> with latest spark.
>
>
> If I use the gradle method as described then the code does work because
> it is pulling the libraries from maven repository as stipulated in
> gradle.properties
> <https://github.com/faizanahemad/spark-gradle-template/blob/master/gradle.properties>
> .
>
>
> In my previous post I *forget* with maven pom.xml you can actually
> specify version number of jar you want to pull from maven repository using 
> *mvn
> clean package *command.
>
>
> So even if I use maven with eclipse then any new libraries uploaded in
> maven repository by developers will have recent version numbers. So will
> not effect my project.
>
> Can you please tell me why the code spark.range(1,101).reduce(_ + _) is
> broken with latest spark ?
>
>
> <http://www.backbutton.co.uk/>
>
>
> On Wed, 24 Jun 2020, 17:07 Jeff Evans, 
> wrote:
>
>> If I'm understanding this correctly, you are building Spark from source
>> and using the built artifacts (jars) in some other project.  Correct?  If
>> so, then why are you concerning yourself with the directory structure that
>> Spark, internally, uses when building its artifacts?  It should be a black
>> box to your application, entirely.  You would pick the profiles (ex: Scala
>> version, Hadoop version, etc.) you need, then the install phase of Maven
>> will take care of building the jars and putting them in your local Maven
>> repo.  After that, you can resolve them from your other project seamlessly
>> (simply by declaring the org/artifact/version).
>>
>> Maven artifacts are immutable, at least released versions in Maven
>> central.  If "someone" (unclear who you are talking about) is "swapping
>> out" jars in a Maven repo then they're doing something extremely strange
>> and broken, unless they're simply replacing snapshot versions, which is a 
>> different
>> beast entirely
>> <https://maven.apache.org/guides/getting-started/index.html#What_is_a_SNAPSHOT_version>
>> .
>>
>> On Wed, Jun 24, 2020 at 10:39 AM Anwar AliKhan 
>> wrote:
>>
>>> THANKS
>>>
>>>
>>> It appears the directory containing the jars have been switched from
>>> download version to source version.
>>>
>>> In the download version it is just below parent directory called jars.
>>> level 1.
>>>
>>> In the git source version it is  4 levels down in the directory
>>>  /spark/assembly/target/scala-2.12/jars
>>>
>>> The issue I have with using maven is that the linking libraries can be
>>> changed at maven repository without my knowledge .
>>> So if an application compiled and worked previously could just break.
>>>
>>> It is not like when the developers make a change to the link libraries
>>> they run it by me first ,  

Suggested Amendment to ./dev/make-distribution.sh

2020-06-25 Thread Anwar AliKhan
 May I suggest amending your ./dev/make-distribution.sh. 蘿
To include a   check if these two previously mentioned packages  are
installed and if not 樂 install them
as part of build process . The build process time will increase if the
packages are not installed.  Long build process is normal  expectation
especially if a project has been going for 10 years.

A message to say these packages are needed but not installed . Please wait
while packages are being installed will be helpful to the user experience.珞






On Wed, 24 Jun 2020, 16:21 Anwar AliKhan,  wrote:

> THANKS !
>
>
> It appears that was the last dependency for the build.
> sudo apt-get install -y r-cran-e1071.
>
> Shout out to  ZOOM
> https://zoomadmin.com/HowToInstall/UbuntuPackage/r-cran-e1071  again
> like they say it was "It’s Super Easy! "
>
> package  knitr was the previous missing dependency which I was able to
> work out from build error message
> sudo apt install knitr
>
> 'e1071' doesn't appear to be a package name or namespace.
> package 'e1071' seems to be a formidable package for machine learning
> algorithms.
>
>
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation
> path
> * DONE (SparkR)
> /opt/spark/R
> + popd
> + mkdir /opt/spark/dist/conf
> + cp /opt/spark/conf/fairscheduler.xml.template
> /opt/spark/conf/log4j.properties.template
> /opt/spark/conf/metrics.properties.template /opt/spark/conf/slaves.template
> /opt/spark/conf/spark-defaults.conf.template
> /opt/spark/conf/spark-env.sh.template /opt/spark/dist/conf
> + cp /opt/spark/README.md /opt/spark/dist
> + cp -r /opt/spark/bin /opt/spark/dist
> + cp -r /opt/spark/python /opt/spark/dist
> + '[' true == true ']'
> + rm -f /opt/spark/dist/python/dist/pyspark-3.1.0.dev0.tar.gz
> + cp -r /opt/spark/sbin /opt/spark/dist
> + '[' -d /opt/spark/R/lib/SparkR ']'
> + mkdir -p /opt/spark/dist/R/lib
> + cp -r /opt/spark/R/lib/SparkR /opt/spark/dist/R/lib
> + cp /opt/spark/R/lib/sparkr.zip /opt/spark/dist/R/lib
> + '[' true == true ']'
> + TARDIR_NAME=spark-3.1.0-SNAPSHOT-bin-custom-spark
> + TARDIR=/opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
> + rm -rf /opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
> + cp -r /opt/spark/dist /opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
> + tar czf spark-3.1.0-SNAPSHOT-bin-custom-spark.tgz -C /opt/spark
> spark-3.1.0-SNAPSHOT-bin-custom-spark
> + rm -rf /opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
> <http://www.backbutton.co.uk/>
>
>
> On Wed, 24 Jun 2020, 11:07 Hyukjin Kwon,  wrote:
>
>> Looks like you haven't installed the 'e1071' package.
>>
>> 2020년 6월 24일 (수) 오후 6:49, Anwar AliKhan 님이 작성:
>>
>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr
>>> -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>> <http://www.backbutton.co.uk/>
>>>
>>>
>>> minor error Spark r test failed , I don't use r so it doesn't effect me.
>>>
>>> ***installing help indices
>>> ** building package indices
>>> ** installing vignettes
>>> ** testing if installed package can be loaded from temporary location
>>> ** testing if installed package can be loaded from final location
>>> ** testing if installed package keeps a record of temporary installation
>>> path
>>> * DONE (SparkR)
>>> ++ cd /opt/spark/R/lib
>>> ++ jar cfM /opt/spark/R/lib/sparkr.zip SparkR
>>> ++ popd
>>> ++ cd /opt/spark/R/..
>>> ++ pwd
>>> + SPARK_HOME=/opt/spark
>>> + . /opt/spark/bin/load-spark-env.sh
>>> ++ '[' -z /opt/spark ']'
>>> ++ SPARK_ENV_SH=spark-env.sh
>>> ++ '[' -z '' ']'
>>> ++ export SPARK_ENV_LOADED=1
>>> ++ SPARK_ENV_LOADED=1
>>> ++ export SPARK_CONF_DIR=/opt/spark/conf
>>> ++ SPARK_CONF_DIR=/opt/spark/conf
>>> ++ SPARK_ENV_SH=/opt/spark/conf/spark-env.sh
>>> ++ [[ -f /opt/spark/conf/spark-env.sh ]]
>>> ++ set -a
>>> ++ . /opt/spark/conf/spark-env.sh
>>> +++ export SPARK_LOCAL_IP=192.168.0.786
>>> +++ SPARK_LOCAL_IP=192.168.0.786
>>> ++ set +a
>>> ++ export SPARK_SCALA_VERSION=2.12
>>> ++ SPARK_SCALA_VERSION=2.12
>>> + '[' -f /opt/spark/RELEASE ']'
>>> + SPARK_JARS_DIR=/opt/spark/assembly/target/scala-2.12/jars
>>> + '[' -d /opt/spark/assembly/target/scala-2.12/jars ']'
>>> + SPARK_HOME=/opt/spark
>&

Re: Where are all the jars gone ?

2020-06-24 Thread Anwar AliKhan
I am using the method describe on this page for Scala development in
eclipse.

https://data-flair.training/blogs/create-spark-scala-project/


in the middle of the page you will find


*“y**ou will see lots of error due to missing libraries.*
viii. Add Spark Libraries”


Now that I have my own build I will be pointing to the jars (spark
libraries)

in directory /opt/spark/assembly/target/scala-2.12/jars


This way I know exactly the jar libraries I am using to remove the
formentioned errors.


At the same time I am trying to setup a template environment as shown here

https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a


so that I can have variables sc and spark in the eclipse editor same you
would have spark, sc variables in the spark-shell.


I used the word trying because the following code is broken


spark.range(1,101).reduce(_ + _)

with latest spark.


If I use the gradle method as described then the code does work because it
is pulling the libraries from maven repository as stipulated in
gradle.properties
<https://github.com/faizanahemad/spark-gradle-template/blob/master/gradle.properties>
.


In my previous post I *forget* with maven pom.xml you can actually specify
version number of jar you want to pull from maven repository using *mvn
clean package *command.


So even if I use maven with eclipse then any new libraries uploaded in
maven repository by developers will have recent version numbers. So will
not effect my project.

Can you please tell me why the code spark.range(1,101).reduce(_ + _) is
broken with latest spark ?


<http://www.backbutton.co.uk/>


On Wed, 24 Jun 2020, 17:07 Jeff Evans, 
wrote:

> If I'm understanding this correctly, you are building Spark from source
> and using the built artifacts (jars) in some other project.  Correct?  If
> so, then why are you concerning yourself with the directory structure that
> Spark, internally, uses when building its artifacts?  It should be a black
> box to your application, entirely.  You would pick the profiles (ex: Scala
> version, Hadoop version, etc.) you need, then the install phase of Maven
> will take care of building the jars and putting them in your local Maven
> repo.  After that, you can resolve them from your other project seamlessly
> (simply by declaring the org/artifact/version).
>
> Maven artifacts are immutable, at least released versions in Maven
> central.  If "someone" (unclear who you are talking about) is "swapping
> out" jars in a Maven repo then they're doing something extremely strange
> and broken, unless they're simply replacing snapshot versions, which is a 
> different
> beast entirely
> <https://maven.apache.org/guides/getting-started/index.html#What_is_a_SNAPSHOT_version>
> .
>
> On Wed, Jun 24, 2020 at 10:39 AM Anwar AliKhan 
> wrote:
>
>> THANKS
>>
>>
>> It appears the directory containing the jars have been switched from
>> download version to source version.
>>
>> In the download version it is just below parent directory called jars.
>> level 1.
>>
>> In the git source version it is  4 levels down in the directory
>>  /spark/assembly/target/scala-2.12/jars
>>
>> The issue I have with using maven is that the linking libraries can be
>> changed at maven repository without my knowledge .
>> So if an application compiled and worked previously could just break.
>>
>> It is not like when the developers make a change to the link libraries
>> they run it by me first ,  they just upload it to maven repository with
>> out asking me if their change
>> Is going to impact my app.
>>
>>
>>
>>
>>
>>
>> On Wed, 24 Jun 2020, 16:07 ArtemisDev,  wrote:
>>
>>> If you are using Maven to manage your jar dependencies, the jar files
>>> are located in the maven repository on your home directory.  It is usually
>>> in the .m2 directory.
>>>
>>> Hope this helps.
>>>
>>> -ND
>>> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>>>
>>> Hi,
>>>
>>> I prefer to do most of my projects in Python and for that I use Jupyter.
>>> I have been downloading the compiled version of spark.
>>>
>>> I do not normally like the source code version because the build process
>>> makes me nervous.
>>> You know with lines of stuff   scrolling up the screen.
>>> What am I am going to do if a build fails. I am a user!
>>>
>>> I decided to risk it and it was only one  mvn command to build. (45
>>> minutes later)
>>> Everything is great. Success.
>>>
>>> I removed all jvms except jdk8 for compilation.
>>>
>>> I used jdk

Re: Where are all the jars gone ?

2020-06-24 Thread Anwar AliKhan
THANKS


It appears the directory containing the jars have been switched from
download version to source version.

In the download version it is just below parent directory called jars.
level 1.

In the git source version it is  4 levels down in the directory
 /spark/assembly/target/scala-2.12/jars

The issue I have with using maven is that the linking libraries can be
changed at maven repository without my knowledge .
So if an application compiled and worked previously could just break.

It is not like when the developers make a change to the link libraries they
run it by me first ,  they just upload it to maven repository with out
asking me if their change
Is going to impact my app.






On Wed, 24 Jun 2020, 16:07 ArtemisDev,  wrote:

> If you are using Maven to manage your jar dependencies, the jar files are
> located in the maven repository on your home directory.  It is usually in
> the .m2 directory.
>
> Hope this helps.
>
> -ND
> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>
> Hi,
>
> I prefer to do most of my projects in Python and for that I use Jupyter.
> I have been downloading the compiled version of spark.
>
> I do not normally like the source code version because the build process
> makes me nervous.
> You know with lines of stuff   scrolling up the screen.
> What am I am going to do if a build fails. I am a user!
>
> I decided to risk it and it was only one  mvn command to build. (45
> minutes later)
> Everything is great. Success.
>
> I removed all jvms except jdk8 for compilation.
>
> I used jdk8 so I know which libraries where linked in the build process.
> I also used my local version of maven. Not the apt install version .
>
> I used jdk8 because if you go this scala site.
>
> http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for IDE
>  even for scala12.
> They don't say JDK 8 or higher ,  just jdk8.
>
> So anyway  once in a while I  do spark projects in scala with eclipse.
>
> For that I don't use maven or anything. I prefer to make use of build path
> And external jars. This way I know exactly which libraries I am linking to.
>
> creating a jar in eclipse is straight forward for spark_submit.
>
>
> Anyway  as you can see (below) I am pointing jupyter to find
> spark.init('opt/spark').
> That's OK everything is fine.
>
> With the compiled version of spark there is a jar directory which I have
> been using in eclipse.
>
>
>
> With my own compiled from source version there is no jar directory.
>
>
> Where are all the jars gone  ?.
>
>
>
> I am not sure how findspark.init('/opt/spark') is locating the libraries
> unless it is finding them from
> Anaconda.
>
>
> import findspark
> findspark.init('/opt/spark')
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName('Titanic Data') \
> .getOrCreate()
>
>


Re: Error: Vignette re-building failed. Execution halted

2020-06-24 Thread Anwar AliKhan
THANKS !


It appears that was the last dependency for the build.
sudo apt-get install -y r-cran-e1071.

Shout out to  ZOOM
https://zoomadmin.com/HowToInstall/UbuntuPackage/r-cran-e1071  again
like they say it was "It’s Super Easy! "

package  knitr was the previous missing dependency which I was able to work
out from build error message
sudo apt install knitr

'e1071' doesn't appear to be a package name or namespace.
package 'e1071' seems to be a formidable package for machine learning
algorithms.


*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation
path
* DONE (SparkR)
/opt/spark/R
+ popd
+ mkdir /opt/spark/dist/conf
+ cp /opt/spark/conf/fairscheduler.xml.template
/opt/spark/conf/log4j.properties.template
/opt/spark/conf/metrics.properties.template /opt/spark/conf/slaves.template
/opt/spark/conf/spark-defaults.conf.template
/opt/spark/conf/spark-env.sh.template /opt/spark/dist/conf
+ cp /opt/spark/README.md /opt/spark/dist
+ cp -r /opt/spark/bin /opt/spark/dist
+ cp -r /opt/spark/python /opt/spark/dist
+ '[' true == true ']'
+ rm -f /opt/spark/dist/python/dist/pyspark-3.1.0.dev0.tar.gz
+ cp -r /opt/spark/sbin /opt/spark/dist
+ '[' -d /opt/spark/R/lib/SparkR ']'
+ mkdir -p /opt/spark/dist/R/lib
+ cp -r /opt/spark/R/lib/SparkR /opt/spark/dist/R/lib
+ cp /opt/spark/R/lib/sparkr.zip /opt/spark/dist/R/lib
+ '[' true == true ']'
+ TARDIR_NAME=spark-3.1.0-SNAPSHOT-bin-custom-spark
+ TARDIR=/opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
+ rm -rf /opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
+ cp -r /opt/spark/dist /opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
+ tar czf spark-3.1.0-SNAPSHOT-bin-custom-spark.tgz -C /opt/spark
spark-3.1.0-SNAPSHOT-bin-custom-spark
+ rm -rf /opt/spark/spark-3.1.0-SNAPSHOT-bin-custom-spark
<http://www.backbutton.co.uk/>


On Wed, 24 Jun 2020, 11:07 Hyukjin Kwon,  wrote:

> Looks like you haven't installed the 'e1071' package.
>
> 2020년 6월 24일 (수) 오후 6:49, Anwar AliKhan 님이 작성:
>
>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr
>> -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>> <http://www.backbutton.co.uk/>
>>
>>
>> minor error Spark r test failed , I don't use r so it doesn't effect me.
>>
>> ***installing help indices
>> ** building package indices
>> ** installing vignettes
>> ** testing if installed package can be loaded from temporary location
>> ** testing if installed package can be loaded from final location
>> ** testing if installed package keeps a record of temporary installation
>> path
>> * DONE (SparkR)
>> ++ cd /opt/spark/R/lib
>> ++ jar cfM /opt/spark/R/lib/sparkr.zip SparkR
>> ++ popd
>> ++ cd /opt/spark/R/..
>> ++ pwd
>> + SPARK_HOME=/opt/spark
>> + . /opt/spark/bin/load-spark-env.sh
>> ++ '[' -z /opt/spark ']'
>> ++ SPARK_ENV_SH=spark-env.sh
>> ++ '[' -z '' ']'
>> ++ export SPARK_ENV_LOADED=1
>> ++ SPARK_ENV_LOADED=1
>> ++ export SPARK_CONF_DIR=/opt/spark/conf
>> ++ SPARK_CONF_DIR=/opt/spark/conf
>> ++ SPARK_ENV_SH=/opt/spark/conf/spark-env.sh
>> ++ [[ -f /opt/spark/conf/spark-env.sh ]]
>> ++ set -a
>> ++ . /opt/spark/conf/spark-env.sh
>> +++ export SPARK_LOCAL_IP=192.168.0.786
>> +++ SPARK_LOCAL_IP=192.168.0.786
>> ++ set +a
>> ++ export SPARK_SCALA_VERSION=2.12
>> ++ SPARK_SCALA_VERSION=2.12
>> + '[' -f /opt/spark/RELEASE ']'
>> + SPARK_JARS_DIR=/opt/spark/assembly/target/scala-2.12/jars
>> + '[' -d /opt/spark/assembly/target/scala-2.12/jars ']'
>> + SPARK_HOME=/opt/spark
>> + /usr/bin/R CMD build /opt/spark/R/pkg
>> * checking for file ‘/opt/spark/R/pkg/DESCRIPTION’ ... OK
>> * preparing ‘SparkR’:
>> * checking DESCRIPTION meta-information ... OK
>> * installing the package to build vignettes
>> * creating vignettes ... ERROR
>> --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
>>
>> Attaching package: 'SparkR'
>>
>> The following objects are masked from 'package:stats':
>>
>> cov, filter, lag, na.omit, predict, sd, var, window
>>
>> The following objects are masked from 'package:base':
>>
>> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
>> rank, rbind, sample, startsWith, subset, summary, transform, union
>>
>> Picked up _JAVA_OPTIONS: -XX:-UsePerfData
>> Picked up _JAVA_OPTIONS: -XX:-UsePerfData
>> 20/06/24 10:23:54 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where

Error: Vignette re-building failed. Execution halted

2020-06-24 Thread Anwar AliKhan
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr
-Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes



minor error Spark r test failed , I don't use r so it doesn't effect me.

***installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation
path
* DONE (SparkR)
++ cd /opt/spark/R/lib
++ jar cfM /opt/spark/R/lib/sparkr.zip SparkR
++ popd
++ cd /opt/spark/R/..
++ pwd
+ SPARK_HOME=/opt/spark
+ . /opt/spark/bin/load-spark-env.sh
++ '[' -z /opt/spark ']'
++ SPARK_ENV_SH=spark-env.sh
++ '[' -z '' ']'
++ export SPARK_ENV_LOADED=1
++ SPARK_ENV_LOADED=1
++ export SPARK_CONF_DIR=/opt/spark/conf
++ SPARK_CONF_DIR=/opt/spark/conf
++ SPARK_ENV_SH=/opt/spark/conf/spark-env.sh
++ [[ -f /opt/spark/conf/spark-env.sh ]]
++ set -a
++ . /opt/spark/conf/spark-env.sh
+++ export SPARK_LOCAL_IP=192.168.0.786
+++ SPARK_LOCAL_IP=192.168.0.786
++ set +a
++ export SPARK_SCALA_VERSION=2.12
++ SPARK_SCALA_VERSION=2.12
+ '[' -f /opt/spark/RELEASE ']'
+ SPARK_JARS_DIR=/opt/spark/assembly/target/scala-2.12/jars
+ '[' -d /opt/spark/assembly/target/scala-2.12/jars ']'
+ SPARK_HOME=/opt/spark
+ /usr/bin/R CMD build /opt/spark/R/pkg
* checking for file ‘/opt/spark/R/pkg/DESCRIPTION’ ... OK
* preparing ‘SparkR’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Picked up _JAVA_OPTIONS: -XX:-UsePerfData
Picked up _JAVA_OPTIONS: -XX:-UsePerfData
20/06/24 10:23:54 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).

[Stage 0:>  (0 + 1)
/ 1]




[Stage 9:=>  (88 + 1) /
100]




[Stage 13:===>  (147 + 1) /
200]



20/06/24 10:24:04 WARN Instrumentation: [79237008] regParam is zero, which
might cause numerical instability and overfitting.
20/06/24 10:24:04 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
20/06/24 10:24:04 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
20/06/24 10:24:04 WARN LAPACK: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemLAPACK
20/06/24 10:24:04 WARN LAPACK: Failed to load implementation from:
com.github.fommil.netlib.NativeRefLAPACK
20/06/24 10:24:09 WARN package: Truncated the string representation of a
plan since it was too large. This behavior can be adjusted by setting
'spark.sql.debug.maxToStringFields'.

[Stage 67:>  (45 + 1) /
200]

[Stage 67:=> (62 + 1) /
200]

[Stage 67:==>(80 + 1) /
200]

[Stage 67:==>(98 + 1) /
200]

[Stage 67:==>   (114 + 1) /
200]

[Stage 67:===>  (132 + 1) /
200]

[Stage 67:===>  (148 + 1) /
200]

[Stage 67:> (166 + 1) /
200]

[Stage 67:=>(184 + 1) /
200]




[Stage 69:>  (44 + 1) /
200]

[Stage 69:>  (61 + 1) /
200]

[Stage 69:=> (79 + 1) /
200]

[Stage 69:==>(97 + 1) /
200]

[Stage 69:===>  (116 + 1) /
200]

[Stage 69:> (134 + 1) /
200]

[Stage 69:=>(152 + 1) /
200]

[Stage 69:=>(169 + 1) /
200]

[Stage 69:==>   (187 + 1) /
200]




[Stage 70:> (0 + 1)
/ 5]
20/06/24 10:24:14 ERROR Executor: Exception in task 0.0 in stage 70.0 (TID
1148)

Found jars in /assembly/target/scala-2.12/jars

2020-06-23 Thread Anwar AliKhan



Where are all the jars gone ?

2020-06-23 Thread Anwar AliKhan
Hi,

I prefer to do most of my projects in Python and for that I use Jupyter.
I have been downloading the compiled version of spark.

I do not normally like the source code version because the build process
makes me nervous.
You know with lines of stuff   scrolling up the screen.
What am I am going to do if a build fails. I am a user!

I decided to risk it and it was only one  mvn command to build. (45 minutes
later)
Everything is great. Success.

I removed all jvms except jdk8 for compilation.

I used jdk8 so I know which libraries where linked in the build process.
I also used my local version of maven. Not the apt install version .

I used jdk8 because if you go this scala site.

http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for IDE
 even for scala12.
They don't say JDK 8 or higher ,  just jdk8.

So anyway  once in a while I  do spark projects in scala with eclipse.

For that I don't use maven or anything. I prefer to make use of build path
And external jars. This way I know exactly which libraries I am linking to.

creating a jar in eclipse is straight forward for spark_submit.


Anyway  as you can see (below) I am pointing jupyter to find
spark.init('opt/spark').
That's OK everything is fine.

With the compiled version of spark there is a jar directory which I have
been using in eclipse.



With my own compiled from source version there is no jar directory.


Where are all the jars gone  ?.



I am not sure how findspark.init('/opt/spark') is locating the libraries
unless it is finding them from
Anaconda.


import findspark
findspark.init('/opt/spark')
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('Titanic Data') \
.getOrCreate()



Re: Hey good looking toPandas () error stack

2020-06-21 Thread Anwar AliKhan
The only change I am making is spark  directory name.
It keeps failing in this same cell. df.toPandas()


findspark.init('/home/spark-2.4.6-bin-hadoop2.7') FAIL

findspark.init('/home/spark-3.0.0-bin-hadoop2.7'). PASS





On Sun, 21 Jun 2020, 19:51 randy clinton,  wrote:

> You can see from the GitHub history for "toPandas()" that the function has
> been in the code for 5 years.
>
> https://github.com/apache/spark/blame/a075cd5b700f88ef447b559c6411518136558d78/python/pyspark/sql/dataframe.py#L923
>
> When I google IllegalArgumentException: 'Unsupported class file major
> version 55'
>
> I see posts about the Java version being used. Are you sure your configs
> are right?
>
> https://stackoverflow.com/questions/53583199/pyspark-error-unsupported-class-file-major-version
>
> On Sat, Jun 20, 2020 at 6:17 AM Anwar AliKhan 
> wrote:
>
>>
>> Two versions of Spark running against same code
>>
>>
>> https://towardsdatascience.com/your-first-apache-spark-ml-model-d2bb82b599dd
>>
>> version spark-2.4.6-bin-hadoop2.7 is producing error for toPandas(). See
>> error stack below
>>
>> Jupyter Notebook
>>
>> import findspark
>>
>> findspark.init('/home/spark-3.0.0-bin-hadoop2.7')
>>
>> cell "spark"
>>
>> cell output
>>
>> SparkSession - in-memory
>>
>> SparkContext
>>
>> Spark UI
>>
>> Version
>>
>> v3.0.0
>>
>> Master
>>
>> local[*]
>>
>> AppName
>>
>> Titanic Data
>>
>>
>> import findspark
>>
>> findspark.init('/home/spark-2.4.6-bin-hadoop2.7')
>>
>> cell  "spark"
>>
>>
>>
>> cell output
>>
>> SparkSession - in-memory
>>
>> SparkContext
>>
>> Spark UI
>>
>> Version
>>
>> v2.4.6
>>
>> Master
>>
>> local[*]
>>
>> AppName
>>
>> Titanic Data
>>
>> cell "df.show(5)"
>>
>>
>> +---++--++--+---+-+-++---+-++
>>
>> |PassengerId|Survived|Pclass|Name|
>> Sex|Age|SibSp|Parch|  Ticket|   Fare|Cabin|Embarked|
>>
>>
>> +---++--++--+---+-+-++---+-++
>>
>> |  1|   0| 3|Braund, Mr. Owen ...|  male| 22|1|0|
>>   A/5 21171|   7.25| null|   S|
>>
>> |  2|   1| 1|Cumings, Mrs. Joh...|female| 38|1|
>> 0|PC 17599|71.2833|  C85|   C|
>>
>> |  3|   1| 3|Heikkinen, Miss. ...|female| 26|0|
>> 0|STON/O2. 3101282|  7.925| null|   S|
>>
>> |  4|   1| 1|Futrelle, Mrs. Ja...|female| 35|1|
>> 0|  113803|   53.1| C123|   S|
>>
>> |  5|   0| 3|Allen, Mr. Willia...|  male| 35|0|
>> 0|  373450|   8.05| null|   S|
>>
>>
>> +---++--++--+---+-+-++---+-++
>>
>> only showing top 5 rows
>>
>> cell "df.toPandas()"
>>
>> cell output
>>
>>
>> ---
>>
>> Py4JJavaError Traceback (most recent call
>> last)
>>
>> /home/spark-2.4.6-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,
>> **kw)
>>
>>  62 try:
>>
>> ---> 63 return f(*a, **kw)
>>
>>  64 except py4j.protocol.Py4JJavaError as e:
>>
>> /home/spark-2.4.6-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py
>> in get_return_value(answer, gateway_client, target_id, name)
>>
>> 327 "An error occurred while calling
>> {0}{1}{2}.\n".
>>
>> --> 328 format(target_id, ".", name), value)
>>
>> 329 else:
>>
>> Py4JJavaError: An error occurred while calling o33.collectToPython.
>>
>> : java.lang.IllegalArgumentException: Unsupported class file major
>> version 55
>>
>> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:166)
>>
>> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:148)
>>
>> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:136)
>>
>> at org.apache.xbean.asm6.ClassReader.(ClassReader.java:237)
>>
>>

Re: Hey good looking toPandas () error stack

2020-06-20 Thread Anwar AliKhan
pply(RDD.scala:990)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)

at org.apache.spark.rdd.RDD.collect(RDD.scala:989)

at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)

at
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)

at
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)

at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)

at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)

at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)

at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)

at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)

at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)

at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:566)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.base/java.lang.Thread.run(Thread.java:834)


During handling of the above exception, another exception occurred:

IllegalArgumentException  Traceback (most recent call last)

 in 

> 1 df.toPandas()

/home/spark-2.4.6-bin-hadoop2.7/python/pyspark/sql/dataframe.py in
toPandas(self)

   2153

   2154 # Below is toPandas without Arrow optimization.

-> 2155 pdf = pd.DataFrame.from_records(self.collect(),
columns=self.columns)

   2156 column_counter = Counter(self.columns)

   2157

/home/spark-2.4.6-bin-hadoop2.7/python/pyspark/sql/dataframe.py in
collect(self)

533 """

534 with SCCallSiteSync(self._sc) as css:

--> 535 sock_info = self._jdf.collectToPython()

536 return list(_load_from_socket(sock_info,
BatchedSerializer(PickleSerializer(

537

/home/spark-2.4.6-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
in __call__(self, *args)

   1255 answer = self.gateway_client.send_command(command)

   1256 return_value = get_return_value(

-> 1257 answer, self.gateway_client, self.target_id, self.name)

   1258

   1259 for temp_arg in temp_args:

/home/spark-2.4.6-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,
**kw)

 77 raise QueryExecutionException(s.split(': ', 1)[1],
stackTrace)

 78 if s.startswith('java.lang.IllegalArgumentException: '):

---> 79 raise IllegalArgumentException(s.split(': ', 1)[1],
stackTrace)

 80 raise

 81 return deco

IllegalArgumentException: 'Unsupported class file major version 55'


On Fri, 19 Jun 2020, 08:06 Stephen Boesch,  wrote:

> afaik It has been there since  Spark 2.0 in 2015.   Not certain about
> Spark 1.5/1.6
>
> On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan 
> wrote:
>
>> I first ran the  command
>> df.show()
>>
>> For sanity check of my dataFrame.
>>
>> I wasn't impressed with the display.
>>
>> I then ran
>> df.toPandas() in Jupiter Notebook.
>>
>> Now the display is really good looking .
>>
>> Is toPandas() a new function which became available in Spark 3.0 ?
>>
>>
>>
>>
>>
>>


Re: Hey good looking toPandas ()

2020-06-19 Thread Anwar AliKhan
I got an illegal argument error with 2.4.6.

I then pointed my Jupiter notebook  to 3.0 version and it worked as
expected.
Using same .ipnyb file.

I was following this machine learning example.
“Your First Apache Spark ML Model” by Favio Vázquez
https://towardsdatascience.com/your-first-apache-spark-ml-model-d2bb82b599dd


In the example he is using version 3.0 so I assumed I got the error because
I am using different version (2.4.6).



On Fri, 19 Jun 2020, 08:06 Stephen Boesch,  wrote:

> afaik It has been there since  Spark 2.0 in 2015.   Not certain about
> Spark 1.5/1.6
>
> On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan 
> wrote:
>
>> I first ran the  command
>> df.show()
>>
>> For sanity check of my dataFrame.
>>
>> I wasn't impressed with the display.
>>
>> I then ran
>> df.toPandas() in Jupiter Notebook.
>>
>> Now the display is really good looking .
>>
>> Is toPandas() a new function which became available in Spark 3.0 ?
>>
>>
>>
>>
>>
>>


Hey good looking toPandas ()

2020-06-19 Thread Anwar AliKhan
I first ran the  command
df.show()

For sanity check of my dataFrame.

I wasn't impressed with the display.

I then ran
df.toPandas() in Jupiter Notebook.

Now the display is really good looking .

Is toPandas() a new function which became available in Spark 3.0 ?


Add python library

2020-06-06 Thread Anwar AliKhan
 " > Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
 "

This is weird !
I was hanging out here https://machinelearningmastery.com/start-here/.
When I came across this post.

The weird part is I was just wondering  how I can take one of the
projects(Open AI GYM taxi-vt2 in Python), a project I want to develop
further.

I want to run on Spark using Spark's parallelism features and GPU
capabilities,  when I am using bigger datasets . While installing the
workers (slaves)  doing the sliced dataset computations on the new 8GB RAM
Raspberry Pi (Linux).

Are any other documents on official website which shows how to do that,  or
any other location  , preferably showing full self contained examples?



On Fri, 5 Jun 2020, 09:02 Dark Crusader, 
wrote:

> Hi Stone,
>
>
> I haven't tried it with .so files however I did use the approach he
> recommends to install my other dependencies.
> I Hope it helps.
>
> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong  wrote:
>
>> Hi,
>>
>> So my pyspark app depends on some python libraries, it is not a problem,
>> I pack all the dependencies into a file libs.zip, and then call
>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>
>> Then I encountered a problem, if any of my library has any binary file
>> dependency (like .so files), this approach does not work. Mainly because
>> when you set PYTHONPATH to a zip file, python does not look up needed
>> binary library (e.g. a .so file) inside the zip file, this is a python
>> *limitation*. So I got a workaround:
>>
>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>> directory
>> 2) When my python code starts, manually call *sys.path.insert(0,
>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>
>> This workaround works well for me. Then I got another problem: what if my
>> code in executor need python library that has binary code? Below is am
>> example:
>>
>> def do_something(p):
>> ...
>>
>> rdd = sc.parallelize([
>> {"x": 1, "y": 2},
>> {"x": 2, "y": 3},
>> {"x": 3, "y": 4},
>> ])
>> a = rdd.map(do_something)
>>
>> What if the function "do_something" need a python library that has
>> binary code? My current solution is, extract libs.zip into a NFS share (or
>> a SMB share) and manually do *sys.path.insert(0,
>> f"share_mount_dir/libs") *in my "do_something" function, but adding such
>> code in each function looks ugly, is there any better/elegant solution?
>>
>> Thanks,
>> Stone
>>
>>


Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Anwar AliKhan
Optimisation of Spark applications

Apache Spark  is an in-memory
data processing tool widely used in companies to deal with Big Data issues.
Running a Spark application in production requires user-defined resources.
This article presents several Spark concepts to optimize the use of the
engine, both in the writing of the code and in the selection of execution
parameters. These concepts will be illustrated through a use case with a
focus on best practices for allocating ressources of a Spark applications
in a Hadoop Yarn  environment.
Spark
Cluster: terminologies and modes

Deploying a Spark application in a YARN cluster requires an understanding
of the “master-slave” model as well as the operation of several components:
the Cluster Manager, the Spark Driver, the Spark Executors and the Edge
Node concept.

The “master-slave” model defines two types of entities: the master controls
and centralizes the communications of the slaves. It is a model that is
often applied in the implementation of clusters and/or for parallel
processing. It is also the model used by Spark applications.

The *Cluster Manager* maintains the physical machines on which the Driver
and its Executors are going to run and allocates the requested resources to
the users. Spark supports 4 Cluster Managers: Apache YARN, Mesos,
Standalone and, recently, Kubernetes. We will focus on YARN.

The *Spark Driver* is the entity that manages the execution of the Spark
application (the master), each application is associated with a Driver. Its
role is to interpret the application’s code to transform it into a sequence
of tasks and to maintain all the states and tasks of the Executors.

The *Spark Executors* are the entities responsible for performing the tasks
assigned to them by the Driver (the slaves). They will read these tasks,
execute them and return their states (Success/Fail) and results. The
Executors are linked to only one application at a time.

The *Edge Node* is a physical/virtual machine where users will connect to
instantiate their Spark applications. It serves as an interface between the
cluster and the outside world. It is a comfort zone where components are
pre-installed and most importantly, pre-configured.
Execution
modes

There are different ways to deploy a Spark application:

   - The *Cluster* mode: This is the most common, the user sends a JAR file
   or a Python script to the Cluster Manager. The latter will instantiate a
   Driver and Executors on the different nodes of the cluster. The CM is
   responsible for all processes related to the Spark application. We will use
   it to handle our example: it facilitates the allocation of resources and
   releases them as soon as the application is finished.
   - The *Client* mode: Almost identical to *cluster* mode with the
   difference that the driver is instantiated on the machine where the job is
   submitted, i.e. outside the cluster. It is often used for program
   development because the logs are directly displayed in the current
   terminal, and the instance of the driver is linked to the user’s session.
   This mode is not recommended in production because the Edge Node can
   quickly reach saturation in terms of resources and the Edge Node is a SPOF
   (Single Point Of Failure).
   - The *Local* mode: the Driver and Executors run on the machine on which
   the user is logged in. It is only recommended for the purpose of testing an
   application in a local environment or for executing unit tests.

The number of Executors and their respective resources are provided
directly in the spark-submit command, or via the configuration properties
injected at the creation of the SparkSession object. Once the Executors are
created, they will communicate with the Driver, which will distribute the
processing tasks.

Resources

A Spark application works as follows: data is stored in memory, and the
CPUs are responsible for performing the tasks of an application. The
application is therefore constrained by the resources used, including
memory and CPUs, which are defined for the Driver and Executors.

Spark applications can generally be divided into two types:

   - *Memory-intensive*: Applications involving massive joins or HashMap
   processing. These operations are expensive in terms of memory.
   - *CPU-intensive*: All applications involving sorting operations or
   searching for particular data. These types of jobs become intensive
   depending on the frequency of these operations.

Some applications are both memory intensive and CPU intensive: some models
of Machine Learning, for example, require 

Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Anwar AliKhan
What meaning Dataframes are RDDs under the cover ?

What meaning deduplication ?


Please send your  bio data history and past commercial projects.

The Wali Ahad agreed to release 300 million USD for new machine learning
research
Project to centralize government facilities to find better way to offer
Citizen Service with artificial Intelligence Technologies.

I am to find talented Artificial Intelligence Experts.


Shukran



On Sat, 30 May 2020, 05:26 Sonal Goyal,  wrote:

> Hi Rishi,
>
> 1. Dataframes are RDDs under the cover. If you have unstructured data or
> if you know something about the data through which you can optimize the
> computation. you can go with RDDs. Else the Dataframes which are optimized
> by Spark SQL should be fine.
> 2. For incremental deduplication, I guess you can hash your data based on
> some particular values and then only compare the new records against the
> ones which have the same hash. That should reduce the order of comparisons
> drastically provided you can come up with a good indexing/hashing scheme as
> per your dataset.
>
> Thanks,
> Sonal
> Nube Technologies 
>
> 
>
>
>
>
> On Sat, May 30, 2020 at 8:17 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have around 100B records where I get new , update & delete records.
>> Update/delete records are not that frequent. I would like to get some
>> advice on below:
>>
>> 1) should I use rdd + reducibly or DataFrame window operation for data of
>> this size? Which one would outperform the other? Which is more reliable and
>> low maintenance?
>> 2) Also how would you suggest we do incremental deduplication? Currently
>> we do full processing once a week and no dedupe during week days to avoid
>> heavy processing. However I would like to explore incremental dedupe option
>> and weight pros/cons.
>>
>> Any input is highly appreciated!
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>


Re: Spark Security

2020-05-29 Thread Anwar AliKhan
What is the size of your .tsv file   sir  ?
What is the size of your local hard drive   sir  ?


Regards


Wali Ahaad


On Fri, 29 May 2020, 16:21 ,  wrote:

> Hello,
>
> I plan to load in a local .tsv file from my hard drive using sparklyr (an
> R package). I have figured out how to do this already on small files.
>
> When I decide to receive my client’s large .tsv file, can I be confident
> that loading in data this way will be secure? I know that this creates a
> Spark connection to help process the data more quickly, but I want to
> verify that the data will be secure after loading it with the Spark
> connection and sparklyr.
>
>
> Thanks,
>
> Wilbert J. Seoane
>
> Sent from iPhone
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>