Re: Incorrect param in Doc ref url:https://spark.apache.org/docs/latest/ml-datasource

2020-02-08 Thread Tanay Banerjee
> Hi Team,
>
> I have found out below incorrect code which will lead to error "NameError:
> name 'true' is not defined". As true is the the parameter required but
> 'True'
>
> All the details are provided below.
>
> URL: https://spark.apache.org/docs/latest/ml-datasource
>
> Issue:
>
> true -> TRUE
> df = spark.read.format("image").option("dropInvalid",
> true).load("data/mllib/images/origin/kittens")
>
> After Correction:
> df = spark.read.format("image").option("dropInvalid",
> True).load("data/mllib/images/origin/kittens")
>
> regards,
> Tanay
>


Re: Incorrect param in Doc ref url:https://spark.apache.org/docs/latest/ml-datasource

2020-02-08 Thread Sean Owen
To be clear, you're referring to the Python version of the example.
Yes it should be True.
Can you open a pull request to fix it? the docs are under docs/ in the
apache/spark github repo. That's how we normally take fixes.

On Sat, Feb 8, 2020 at 9:14 AM Tanay Banerjee  wrote:
>
>
>> Hi Team,
>>
>> I have found out below incorrect code which will lead to error "NameError: 
>> name 'true' is not defined". As true is the the parameter required but 'True'
>>
>> All the details are provided below.
>>
>> URL: https://spark.apache.org/docs/latest/ml-datasource
>>
>> Issue:
>>
>> true -> TRUE
>> df = spark.read.format("image").option("dropInvalid", 
>> true).load("data/mllib/images/origin/kittens")
>>
>> After Correction:
>> df = spark.read.format("image").option("dropInvalid", 
>> True).load("data/mllib/images/origin/kittens")
>>
>> regards,
>> Tanay

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Fwd: dataframe null safe joins given a list of columns

2020-02-08 Thread Enrico Minack

Hi Devs,

I am forwarding this from the user mailing list. I agree that the <=> 
version of join(Dataset[_], Seq[String]) would be useful.


Does any PMC consider this useful enough to be added to the Dataset API? 
I'd be happy to create a PR in that case.


Enrico



 Weitergeleitete Nachricht 
Betreff:dataframe null safe joins given a list of columns
Datum:  Thu, 6 Feb 2020 12:45:11 +
Von:Marcelo Valle 
An: user @spark 



I was surprised I couldn't find a way of solving this in spark, as it 
must be a very common problem for users. Then I decided to ask here.


Consider the code bellow:

```
val joinColumns = Seq("a", "b")
val df1 = Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"), ("a4", null, 
"c4")).toDF("a", "b", "c")
val df2 = Seq(("a1", "b1", "d1"), ("a3", "b3", "d3"), ("a4", null, 
"d4")).toDF("a", "b", "d")

df1.join(df2, joinColumns).show()
```

The output is :

```
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
| a1| b1| c1| d1|
+---+---+---+---+
```

But I want it to be:

```
+---+-+---+---+
|  a|    b|  c|  d|
+---+-+---+---+
| a1|   b1| c1| d1|
| a4| null| c4| d4|
+---+-+---+---+
```

The join syntax of `df1.join(df2, joinColumns)` has some advantages, as 
it doesn't create duplicate columns by default. However, it uses the 
operator `===` to join, not the null safe one `<=>`.


Using the following syntax:

```
df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).show()
```

Would produce:

```
+---++---+---++---+
|  a|   b|  c|  a|   b|  d|
+---++---+---++---+
| a1|  b1| c1| a1|  b1| d1|
| a4|null| c4| a4|null| d4|
+---++---+---++---+
```

So to get the result I really want, I must do:

```
df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> 
df2("b")).drop(df2("a")).drop(df2("b")).show()

+---++---+---+
|  a|   b|  c|  d|
+---++---+---+
| a1|  b1| c1| d1|
| a4|null| c4| d4|
+---++---+---+
```

Which works, but is really verbose, especially when you have many join 
columns.


Is there a better way of solving this without needing a utility method? 
This same problem is something I find in every spark project.




This email is confidential [and may be protected by legal privilege]. If 
you are not the intended recipient, please do not copy or disclose its 
content but contact the sender immediately upon receipt.


KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, 
United Kingdom




Re: Incorrect param in Doc ref url:https://spark.apache.org/docs/latest/ml-datasource

2020-02-08 Thread Tanay Banerjee
Thanks Sean.
I have just opened a pull request: Branch 2.4 #27500.



On Sat, 8 Feb 2020, 8:54 pm Sean Owen,  wrote:

> To be clear, you're referring to the Python version of the example.
> Yes it should be True.
> Can you open a pull request to fix it? the docs are under docs/ in the
> apache/spark github repo. That's how we normally take fixes.
>
> On Sat, Feb 8, 2020 at 9:14 AM Tanay Banerjee 
> wrote:
> >
> >
> >> Hi Team,
> >>
> >> I have found out below incorrect code which will lead to error
> "NameError: name 'true' is not defined". As true is the the parameter
> required but 'True'
> >>
> >> All the details are provided below.
> >>
> >> URL: https://spark.apache.org/docs/latest/ml-datasource
> >>
> >> Issue:
> >>
> >> true -> TRUE
> >> df = spark.read.format("image").option("dropInvalid",
> true).load("data/mllib/images/origin/kittens")
> >>
> >> After Correction:
> >> df = spark.read.format("image").option("dropInvalid",
> True).load("data/mllib/images/origin/kittens")
> >>
> >> regards,
> >> Tanay
>


Re: Initial Decom PR for Spark 3?

2020-02-08 Thread Erik Erlandson
I'd be willing to pull this in, unless others have concerns post branch-cut.

On Tue, Feb 4, 2020 at 2:51 PM Holden Karau  wrote:

> Hi Y’all,
>
> I’ve got a K8s graceful decom PR (
> https://github.com/apache/spark/pull/26440
>  ) I’d love to try and get in for Spark 3, but I don’t want to push on it
> if folks don’t think it’s worth it. I’ve been working on it since 2017 and
> it was really close in November but then I had the crash and had to step
> back for awhile.
>
> It’s effectiveness is behind a feature flag and it’s been outstanding for
> awhile so those points are in its favour. It does however change things in
> core which is not great.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


[ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.4.5!

Spark 2.4.5 is a maintenance release containing stability fixes. This
release is based on the branch-2.4 maintenance branch of Spark. We strongly
recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.5, head over to the download page:
http://spark.apache.org/downloads.html

Note that you might need to clear your browser cache or
to use `Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2.4.5.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Dongjoon Hyun
There was a typo in one URL. The correct release note URL is here.

https://spark.apache.org/releases/spark-release-2-4-5.html



On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
wrote:

> We are happy to announce the availability of Spark 2.4.5!
>
> Spark 2.4.5 is a maintenance release containing stability fixes. This
> release is based on the branch-2.4 maintenance branch of Spark. We strongly
> recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.5, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or
> to use `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2.4.5.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Takeshi Yamamuro
Happy to hear the release news!

Bests,
Takeshi

On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun 
wrote:

> There was a typo in one URL. The correct release note URL is here.
>
> https://spark.apache.org/releases/spark-release-2-4-5.html
>
>
>
> On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
> wrote:
>
>> We are happy to announce the availability of Spark 2.4.5!
>>
>> Spark 2.4.5 is a maintenance release containing stability fixes. This
>> release is based on the branch-2.4 maintenance branch of Spark. We
>> strongly
>> recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.5, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or
>> to use `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2.4.5.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Dongjoon Hyun
>>
>

-- 
---
Takeshi Yamamuro