[jira] [Resolved] (SPARK-31326) create Function docs structure for SQL Reference

2020-04-02 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31326.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/28099]

> create Function docs structure for SQL Reference
> 
>
> Key: SPARK-31326
> URL: https://issues.apache.org/jira/browse/SPARK-31326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> create Function docs structure for SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31328.
-
Resolution: Fixed

Issue resolved by pull request 28101
[https://github.com/apache/spark/pull/28101]

> Incorrect timestamps rebasing on autumn daylight saving time
> 
>
> Key: SPARK-31328
> URL: https://issues.apache.org/jira/browse/SPARK-31328
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Run the following code in the *America/Los_Angeles* time zone:
> {code:scala}
> test("rebasing differences") {
>   withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
> val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> var micros = start
> var diff = Long.MaxValue
> var counter = 0
> while (micros < end) {
>   val rebased = rebaseGregorianToJulianMicros(micros)
>   val curDiff = rebased - micros
>   if (curDiff != diff) {
> counter += 1
> diff = curDiff
> val ldt = 
> microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
> println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
> minutes")
>   }
>   micros += 30 * MICROS_PER_MINUTE
> }
> println(s"counter = $counter")
>   }
> }
> {code}
> The rebased and original micros must be the same after 1883-11-18 because the 
> standard zone offset and DST offset are the same in Proleptic Gregorian 
> calendar and in the hybrid calendar (Julian+Gregorian) but actually there are 
> differences of 60 minutes:
> {code:java}
> local date-time = 0001-01-01T00:00 diff = -2872 minutes
> local date-time = 0100-03-01T00:00 diff = -1432 minutes
> local date-time = 0200-03-01T00:00 diff = 7 minutes
> local date-time = 0300-03-01T00:00 diff = 1447 minutes
> local date-time = 0500-03-01T00:00 diff = 2887 minutes
> local date-time = 0600-03-01T00:00 diff = 4327 minutes
> local date-time = 0700-03-01T00:00 diff = 5767 minutes
> local date-time = 0900-03-01T00:00 diff = 7207 minutes
> local date-time = 1000-03-01T00:00 diff = 8647 minutes
> local date-time = 1100-03-01T00:00 diff = 10087 minutes
> local date-time = 1300-03-01T00:00 diff = 11527 minutes
> local date-time = 1400-03-01T00:00 diff = 12967 minutes
> local date-time = 1500-03-01T00:00 diff = 14407 minutes
> local date-time = 1582-10-15T00:00 diff = 7 minutes
> local date-time = 1883-11-18T12:22:58 diff = 0 minutes
> local date-time = 1918-10-27T01:22:58 diff = 60 minutes
> local date-time = 1918-10-27T01:22:58 diff = 0 minutes
> local date-time = 1919-10-26T01:22:58 diff = 60 minutes
> local date-time = 1919-10-26T01:22:58 diff = 0 minutes
> local date-time = 1945-09-30T01:22:58 diff = 60 minutes
> local date-time = 1945-09-30T01:22:58 diff = 0 minutes
> local date-time = 1949-01-01T01:22:58 diff = 60 minutes
> local date-time = 1949-01-01T01:22:58 diff = 0 minutes
> local date-time = 1950-09-24T01:22:58 diff = 60 minutes
> local date-time = 1950-09-24T01:22:58 diff = 0 minutes
> local date-time = 1951-09-30T01:22:58 diff = 60 minutes
> local date-time = 1951-09-30T01:22:58 diff = 0 minutes
> local date-time = 1952-09-28T01:22:58 diff = 60 minutes
> local date-time = 1952-09-28T01:22:58 diff = 0 minutes
> local date-time = 1953-09-27T01:22:58 diff = 60 minutes
> local date-time = 1953-09-27T01:22:58 diff = 0 minutes
> local date-time = 1954-09-26T01:22:58 diff = 60 minutes
> local date-time = 1954-09-26T01:22:58 diff = 0 minutes
> local date-time = 1955-09-25T01:22:58 diff = 60 minutes
> local date-time = 1955-09-25T01:22:58 diff = 0 minutes
> local date-time = 1956-09-30T01:22:58 diff = 60 minutes
> local date-time = 1956-09-30T01:22:58 diff = 0 minutes
> local date-time = 1957-09-29T01:22:58 diff = 60 minutes
> local date-time = 1957-09-29T01:22:58 diff = 0 minutes
> local date-time = 1958-09-28T01:22:58 diff = 60 minutes
> local date-time = 1958-09-28T01:22:58 diff = 0 minutes
> local date-time = 1959-09-27T01:22:58 diff = 60 minutes
> local date-time = 1959-09-27T01:22:58 diff = 0 minutes
> local date-time = 1960-09-25T01:22:58 diff = 60 minutes
> local date-time = 1960-09-25T01:22:58 diff = 0 minutes
> local date-time = 1961-09-24T01:22:58 diff = 60 minutes
> local date-time = 1961-09-24T01:22:58 diff = 0 minutes
> local date-time = 1962-10-28T01:22:58 diff = 60 minutes
> local date-time = 1962-10-28T01:22:58 diff = 0 minutes
> local date-time = 1963-10-27T01:22:58 diff = 60 minutes
> local 

[jira] [Resolved] (SPARK-31325) Control a plan explain mode in the events of SQL listeners via SQLConf

2020-04-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31325.

  Assignee: Takeshi Yamamuro
Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/28097

> Control a plan explain mode in the events of SQL listeners via SQLConf
> --
>
> Key: SPARK-31325
> URL: https://issues.apache.org/jira/browse/SPARK-31325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This proposes to add a new SQL config for controlling a plan explain mode in 
> the events of (e.g., `SparkListenerSQLExecutionStart` and 
> `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners.
> In the current master, the output of `QueryExecution.toString` (this is 
> equivalent to the "extended" explain mode) is stored in these events. I think 
> it is useful to control the content via SQLConf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074217#comment-17074217
 ] 

Nicholas Chammas commented on SPARK-31330:
--

Hmm, I didn't see anything from you on the mailing list. But thanks for these 
references! This is very helpful.

Looks like you had Infra enable autolabeler for the Avro project over in 
INFRA-17367. I will ask Infra to do the same for Spark and cc [~hyukjin.kwon] 
for committer approval (which I guess Infra may ask for).

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31333) Document Join Hints

2020-04-02 Thread Xiao Li (Jira)
Xiao Li created SPARK-31333:
---

 Summary: Document Join Hints
 Key: SPARK-31333
 URL: https://issues.apache.org/jira/browse/SPARK-31333
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074142#comment-17074142
 ] 

Ismaël Mejía commented on SPARK-31330:
--

What about the approach I suggested in the ML? 
The autolabeler has not the mentioned llimitation and it has already been used 
by various apache projects:
https://github.com/mithro/autolabeler
https://github.com/apache/avro/blob/master/.github/autolabeler.yml

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074124#comment-17074124
 ] 

Nicholas Chammas commented on SPARK-31330:
--

Unfortunately, it seems I jumped the gun on sending that dev email about the 
GitHub PR labeler action.

It has a fundamental limitation that currently makes it [useless for 
us|https://github.com/actions/labeler/tree/d2c408e7ed8498dfdf675c5f8d133ab37b6f8520#pull-request-labeler]:
{quote}Note that only pull requests being opened from the same repository can 
be labeled. This action will not currently work for pull requests from forks – 
like is common in open source projects – because the token for forked pull 
request workflows does not have write permissions.
{quote}
Additional detail: 
[https://github.com/actions/labeler/issues/12#issuecomment-525762657]

I'll keep my eye on that Action in case they somehow lift or work around the 
limitation on forked repositories.

Of course, we can always implement this functionality ourselves, but the 
attraction of the GitHub Action was that we could reuse an existing, tested, 
and widely adopted implementation.

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31332) Proposal to add Proximity Measure in Random Forest

2020-04-02 Thread Stanley Poon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon updated SPARK-31332:
-
Description: 
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned in 
[https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20]
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage, proximity requires O(NxT) memory, and it may still not fit 
in memory: where N is number of data points and T is number of trees in the 
forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]

  was:
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned in 
[https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20]
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]


> Proposal to add Proximity Measure in Random Forest
> --
>
> Key: SPARK-31332
> URL: https://issues.apache.org/jira/browse/SPARK-31332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5
> Environment: The proposal should apply to any Spark version and OS's 
> that are supported by Spark.
> Specifically, the observations reported were based on:
>  * Spark 2.3.1 and 2.4.5
>  * Ubuntu 16.04.6 LTS
>  * Mac OS 10.13.6
>  
>

[jira] [Updated] (SPARK-31332) Proposal to add Proximity Measure in Random Forest

2020-04-02 Thread Stanley Poon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon updated SPARK-31332:
-
Description: 
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned 
[here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]].
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]

  was:
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned 
[here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]].
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment Based on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]


> Proposal to add Proximity Measure in Random Forest
> --
>
> Key: SPARK-31332
> URL: https://issues.apache.org/jira/browse/SPARK-31332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5
> Environment: The proposal should apply to any Spark version and OS's 
> that are supported by Spark.
> Specifically, the observations reported were based on:
>  * Spark 2.3.1 and 2.4.5
>  * Ubuntu 16.04.6 LTS
>  * Mac OS 10.13.6
>  
>Reporter: Stanley Poon
>Priority: Major
>  Labels: Proximity, RandomForest, ml
>
> h3. Background
> The RandomForest model does not provide proximity 

[jira] [Updated] (SPARK-31332) Proposal to add Proximity Measure in Random Forest

2020-04-02 Thread Stanley Poon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon updated SPARK-31332:
-
Description: 
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned in 
[https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20]
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]

  was:
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned 
[here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]].
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]


> Proposal to add Proximity Measure in Random Forest
> --
>
> Key: SPARK-31332
> URL: https://issues.apache.org/jira/browse/SPARK-31332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5
> Environment: The proposal should apply to any Spark version and OS's 
> that are supported by Spark.
> Specifically, the observations reported were based on:
>  * Spark 2.3.1 and 2.4.5
>  * Ubuntu 16.04.6 LTS
>  * Mac OS 10.13.6
>  
>Reporter: Stanley Poon
>Priority: Major
>  Labels: Proximity, RandomForest, ml
>
> 

[jira] [Created] (SPARK-31332) Proposal to add Proximity Measure in Random Forest

2020-04-02 Thread Stanley Poon (Jira)
Stanley Poon created SPARK-31332:


 Summary: Proposal to add Proximity Measure in Random Forest
 Key: SPARK-31332
 URL: https://issues.apache.org/jira/browse/SPARK-31332
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.5
 Environment: The proposal should apply to any Spark version and OS's 
that are supported by Spark.

Specifically, the observations reported were based on:
 * Spark 2.3.1 and 2.4.5
 * Ubuntu 16.04.6 LTS
 * Mac OS 10.13.6

 
Reporter: Stanley Poon


h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned 
[here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]].
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment Based on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31331) Document Spark integration with Hive UDFs/UDAFs/UDTFs

2020-04-02 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31331:
--

 Summary: Document Spark integration with Hive UDFs/UDAFs/UDTFs
 Key: SPARK-31331
 URL: https://issues.apache.org/jira/browse/SPARK-31331
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document Spark integration with Hive UDFs/UDAFs/UDTFs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31330:


 Summary: Automatically label PRs based on the paths they touch
 Key: SPARK-31330
 URL: https://issues.apache.org/jira/browse/SPARK-31330
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


We can potentially leverage the added labels to drive testing, review, or other 
project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31329) Modify executor monitor to allow for moving shuffle blocks

2020-04-02 Thread Holden Karau (Jira)
Holden Karau created SPARK-31329:


 Summary: Modify executor monitor to allow for moving shuffle blocks
 Key: SPARK-31329
 URL: https://issues.apache.org/jira/browse/SPARK-31329
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core
Affects Versions: 3.1.0
Reporter: Holden Karau
Assignee: Holden Karau


To enable Spark-20629 we need to revisit code that assumes shuffle blocks don't 
move. Currently, the executor monitor assumes that shuffle blocks are 
immovable. We should modify this code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-04-02 Thread Michael Armbrust (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073951#comment-17073951
 ] 

Michael Armbrust commented on SPARK-29358:
--

Sure, but it is very easy to make this not a behavior change.  Add an optional 
boolean parameter, {{allowMissingColumns}} (or something) that defaults to 
{{false}}.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-02 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073928#comment-17073928
 ] 

L. C. Hsieh commented on SPARK-27913:
-

As we support schema merging in ORC by SPARK-11412, is this still an issue?

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31328:
---
Description: 
Run the following code in the *America/Los_Angeles* time zone:
{code:scala}
test("rebasing differences") {
  withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += 30 * MICROS_PER_MINUTE
}
println(s"counter = $counter")
  }
}
{code}
The rebased and original micros must be the same after 1883-11-18 because the 
standard zone offset and DST offset are the same in Proleptic Gregorian 
calendar and in the hybrid calendar (Julian+Gregorian) but actually there are 
differences of 60 minutes:
{code:java}
local date-time = 0001-01-01T00:00 diff = -2872 minutes
local date-time = 0100-03-01T00:00 diff = -1432 minutes
local date-time = 0200-03-01T00:00 diff = 7 minutes
local date-time = 0300-03-01T00:00 diff = 1447 minutes
local date-time = 0500-03-01T00:00 diff = 2887 minutes
local date-time = 0600-03-01T00:00 diff = 4327 minutes
local date-time = 0700-03-01T00:00 diff = 5767 minutes
local date-time = 0900-03-01T00:00 diff = 7207 minutes
local date-time = 1000-03-01T00:00 diff = 8647 minutes
local date-time = 1100-03-01T00:00 diff = 10087 minutes
local date-time = 1300-03-01T00:00 diff = 11527 minutes
local date-time = 1400-03-01T00:00 diff = 12967 minutes
local date-time = 1500-03-01T00:00 diff = 14407 minutes
local date-time = 1582-10-15T00:00 diff = 7 minutes
local date-time = 1883-11-18T12:22:58 diff = 0 minutes
local date-time = 1918-10-27T01:22:58 diff = 60 minutes
local date-time = 1918-10-27T01:22:58 diff = 0 minutes
local date-time = 1919-10-26T01:22:58 diff = 60 minutes
local date-time = 1919-10-26T01:22:58 diff = 0 minutes
local date-time = 1945-09-30T01:22:58 diff = 60 minutes
local date-time = 1945-09-30T01:22:58 diff = 0 minutes
local date-time = 1949-01-01T01:22:58 diff = 60 minutes
local date-time = 1949-01-01T01:22:58 diff = 0 minutes
local date-time = 1950-09-24T01:22:58 diff = 60 minutes
local date-time = 1950-09-24T01:22:58 diff = 0 minutes
local date-time = 1951-09-30T01:22:58 diff = 60 minutes
local date-time = 1951-09-30T01:22:58 diff = 0 minutes
local date-time = 1952-09-28T01:22:58 diff = 60 minutes
local date-time = 1952-09-28T01:22:58 diff = 0 minutes
local date-time = 1953-09-27T01:22:58 diff = 60 minutes
local date-time = 1953-09-27T01:22:58 diff = 0 minutes
local date-time = 1954-09-26T01:22:58 diff = 60 minutes
local date-time = 1954-09-26T01:22:58 diff = 0 minutes
local date-time = 1955-09-25T01:22:58 diff = 60 minutes
local date-time = 1955-09-25T01:22:58 diff = 0 minutes
local date-time = 1956-09-30T01:22:58 diff = 60 minutes
local date-time = 1956-09-30T01:22:58 diff = 0 minutes
local date-time = 1957-09-29T01:22:58 diff = 60 minutes
local date-time = 1957-09-29T01:22:58 diff = 0 minutes
local date-time = 1958-09-28T01:22:58 diff = 60 minutes
local date-time = 1958-09-28T01:22:58 diff = 0 minutes
local date-time = 1959-09-27T01:22:58 diff = 60 minutes
local date-time = 1959-09-27T01:22:58 diff = 0 minutes
local date-time = 1960-09-25T01:22:58 diff = 60 minutes
local date-time = 1960-09-25T01:22:58 diff = 0 minutes
local date-time = 1961-09-24T01:22:58 diff = 60 minutes
local date-time = 1961-09-24T01:22:58 diff = 0 minutes
local date-time = 1962-10-28T01:22:58 diff = 60 minutes
local date-time = 1962-10-28T01:22:58 diff = 0 minutes
local date-time = 1963-10-27T01:22:58 diff = 60 minutes
local date-time = 1963-10-27T01:22:58 diff = 0 minutes
local date-time = 1964-10-25T01:22:58 diff = 60 minutes
local date-time = 1964-10-25T01:22:58 diff = 0 minutes
local date-time = 1965-10-31T01:22:58 diff = 60 minutes
local date-time = 1965-10-31T01:22:58 diff = 0 minutes
local date-time = 1966-10-30T01:22:58 diff = 60 minutes
local date-time = 1966-10-30T01:22:58 diff = 0 minutes
local date-time = 1967-10-29T01:22:58 diff = 60 minutes
local date-time = 1967-10-29T01:22:58 diff = 0 minutes
local date-time = 1968-10-27T01:22:58 diff = 60 minutes
local date-time = 1968-10-27T01:22:58 diff = 0 minutes
local date-time = 1969-10-26T01:22:58 diff = 60 minutes
local date-time = 1969-10-26T01:22:58 diff = 0 minutes
local date-time = 1970-10-25T01:22:58 

[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31328:
---
Description: 
Run the following code in the *America/Los_Angeles* time zone:
{code:scala}
test("rebasing differences") {
  withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += 30 * MICROS_PER_MINUTE
}
println(s"counter = $counter")
  }
}
{code}
{code:java}
local date-time = 0001-01-01T00:00 diff = -2909 minutes
local date-time = 0100-02-28T14:00 diff = -1469 minutes
local date-time = 0200-02-28T14:00 diff = -29 minutes
local date-time = 0300-02-28T14:00 diff = 1410 minutes
local date-time = 0500-02-28T14:00 diff = 2850 minutes
local date-time = 0600-02-28T14:00 diff = 4290 minutes
local date-time = 0700-02-28T14:00 diff = 5730 minutes
local date-time = 0900-02-28T14:00 diff = 7170 minutes
local date-time = 1000-02-28T14:00 diff = 8610 minutes
local date-time = 1100-02-28T14:00 diff = 10050 minutes
local date-time = 1300-02-28T14:00 diff = 11490 minutes
local date-time = 1400-02-28T14:00 diff = 12930 minutes
local date-time = 1500-02-28T14:00 diff = 14370 minutes
local date-time = 1582-10-14T14:00 diff = -29 minutes
local date-time = 1899-12-31T16:52:58 diff = 0 minutes
local date-time = 1917-12-27T11:52:58 diff = 60 minutes
local date-time = 1917-12-27T12:52:58 diff = 0 minutes
local date-time = 1918-09-15T12:52:58 diff = 60 minutes
local date-time = 1918-09-15T13:52:58 diff = 0 minutes
local date-time = 1919-06-30T16:52:58 diff = 31 minutes
local date-time = 1919-06-30T17:52:58 diff = 0 minutes
local date-time = 1919-08-15T12:52:58 diff = 60 minutes
local date-time = 1919-08-15T13:52:58 diff = 0 minutes
local date-time = 1921-08-31T10:52:58 diff = 60 minutes
local date-time = 1921-08-31T11:52:58 diff = 0 minutes
local date-time = 1921-09-30T11:52:58 diff = 60 minutes
local date-time = 1921-09-30T12:52:58 diff = 0 minutes
local date-time = 1922-09-30T12:52:58 diff = 60 minutes
local date-time = 1922-09-30T13:52:58 diff = 0 minutes
local date-time = 1981-09-30T12:52:58 diff = 60 minutes
local date-time = 1981-09-30T13:52:58 diff = 0 minutes
local date-time = 1982-09-30T12:52:58 diff = 60 minutes
local date-time = 1982-09-30T13:52:58 diff = 0 minutes
local date-time = 1983-09-30T12:52:58 diff = 60 minutes
local date-time = 1983-09-30T13:52:58 diff = 0 minutes
local date-time = 1984-09-29T15:52:58 diff = 60 minutes
local date-time = 1984-09-29T16:52:58 diff = 0 minutes
local date-time = 1985-09-28T15:52:58 diff = 60 minutes
local date-time = 1985-09-28T16:52:58 diff = 0 minutes
local date-time = 1986-09-27T15:52:58 diff = 60 minutes
local date-time = 1986-09-27T16:52:58 diff = 0 minutes
local date-time = 1987-09-26T15:52:58 diff = 60 minutes
local date-time = 1987-09-26T16:52:58 diff = 0 minutes
local date-time = 1988-09-24T15:52:58 diff = 60 minutes
local date-time = 1988-09-24T16:52:58 diff = 0 minutes
local date-time = 1989-09-23T15:52:58 diff = 60 minutes
local date-time = 1989-09-23T16:52:58 diff = 0 minutes
local date-time = 1990-09-29T15:52:58 diff = 60 minutes
local date-time = 1990-09-29T16:52:58 diff = 0 minutes
local date-time = 1991-09-28T16:52:58 diff = 60 minutes
local date-time = 1991-09-28T17:52:58 diff = 0 minutes
local date-time = 1992-09-26T15:52:58 diff = 60 minutes
local date-time = 1992-09-26T16:52:58 diff = 0 minutes
local date-time = 1993-09-25T15:52:58 diff = 60 minutes
local date-time = 1993-09-25T16:52:58 diff = 0 minutes
local date-time = 1994-09-24T15:52:58 diff = 60 minutes
local date-time = 1994-09-24T16:52:58 diff = 0 minutes
local date-time = 1995-09-23T15:52:58 diff = 60 minutes
local date-time = 1995-09-23T16:52:58 diff = 0 minutes
local date-time = 1996-10-26T15:52:58 diff = 60 minutes
local date-time = 1996-10-26T16:52:58 diff = 0 minutes
local date-time = 1997-10-25T15:52:58 diff = 60 minutes
local date-time = 1997-10-25T16:52:58 diff = 0 minutes
local date-time = 1998-10-24T15:52:58 diff = 60 minutes
local date-time = 1998-10-24T16:52:58 diff = 0 minutes
local date-time = 1999-10-30T15:52:58 diff = 60 minutes
local date-time = 1999-10-30T16:52:58 diff = 0 minutes
local date-time = 2000-10-28T15:52:58 diff = 60 minutes
local date-time 

[jira] [Created] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31328:
--

 Summary: Incorrect timestamps rebasing on autumn daylight saving 
time
 Key: SPARK-31328
 URL: https://issues.apache.org/jira/browse/SPARK-31328
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.0.0


I do believe it is possible to speed up date-time rebasing by building a map of 
micros to diffs between original and rebased micros. And look up at the map via 
binary search.

For example, the *America/Los_Angeles* time zone has less than 100 points when 
diff changes:
{code:scala}
  test("optimize rebasing") {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += MICROS_PER_HOUR
}
println(s"counter = $counter")
  }
{code}
{code:java}
local date-time = 0001-01-01T00:00 diff = -2909 minutes
local date-time = 0100-02-28T14:00 diff = -1469 minutes
local date-time = 0200-02-28T14:00 diff = -29 minutes
local date-time = 0300-02-28T14:00 diff = 1410 minutes
local date-time = 0500-02-28T14:00 diff = 2850 minutes
local date-time = 0600-02-28T14:00 diff = 4290 minutes
local date-time = 0700-02-28T14:00 diff = 5730 minutes
local date-time = 0900-02-28T14:00 diff = 7170 minutes
local date-time = 1000-02-28T14:00 diff = 8610 minutes
local date-time = 1100-02-28T14:00 diff = 10050 minutes
local date-time = 1300-02-28T14:00 diff = 11490 minutes
local date-time = 1400-02-28T14:00 diff = 12930 minutes
local date-time = 1500-02-28T14:00 diff = 14370 minutes
local date-time = 1582-10-14T14:00 diff = -29 minutes
local date-time = 1899-12-31T16:52:58 diff = 0 minutes
local date-time = 1917-12-27T11:52:58 diff = 60 minutes
local date-time = 1917-12-27T12:52:58 diff = 0 minutes
local date-time = 1918-09-15T12:52:58 diff = 60 minutes
local date-time = 1918-09-15T13:52:58 diff = 0 minutes
local date-time = 1919-06-30T16:52:58 diff = 31 minutes
local date-time = 1919-06-30T17:52:58 diff = 0 minutes
local date-time = 1919-08-15T12:52:58 diff = 60 minutes
local date-time = 1919-08-15T13:52:58 diff = 0 minutes
local date-time = 1921-08-31T10:52:58 diff = 60 minutes
local date-time = 1921-08-31T11:52:58 diff = 0 minutes
local date-time = 1921-09-30T11:52:58 diff = 60 minutes
local date-time = 1921-09-30T12:52:58 diff = 0 minutes
local date-time = 1922-09-30T12:52:58 diff = 60 minutes
local date-time = 1922-09-30T13:52:58 diff = 0 minutes
local date-time = 1981-09-30T12:52:58 diff = 60 minutes
local date-time = 1981-09-30T13:52:58 diff = 0 minutes
local date-time = 1982-09-30T12:52:58 diff = 60 minutes
local date-time = 1982-09-30T13:52:58 diff = 0 minutes
local date-time = 1983-09-30T12:52:58 diff = 60 minutes
local date-time = 1983-09-30T13:52:58 diff = 0 minutes
local date-time = 1984-09-29T15:52:58 diff = 60 minutes
local date-time = 1984-09-29T16:52:58 diff = 0 minutes
local date-time = 1985-09-28T15:52:58 diff = 60 minutes
local date-time = 1985-09-28T16:52:58 diff = 0 minutes
local date-time = 1986-09-27T15:52:58 diff = 60 minutes
local date-time = 1986-09-27T16:52:58 diff = 0 minutes
local date-time = 1987-09-26T15:52:58 diff = 60 minutes
local date-time = 1987-09-26T16:52:58 diff = 0 minutes
local date-time = 1988-09-24T15:52:58 diff = 60 minutes
local date-time = 1988-09-24T16:52:58 diff = 0 minutes
local date-time = 1989-09-23T15:52:58 diff = 60 minutes
local date-time = 1989-09-23T16:52:58 diff = 0 minutes
local date-time = 1990-09-29T15:52:58 diff = 60 minutes
local date-time = 1990-09-29T16:52:58 diff = 0 minutes
local date-time = 1991-09-28T16:52:58 diff = 60 minutes
local date-time = 1991-09-28T17:52:58 diff = 0 minutes
local date-time = 1992-09-26T15:52:58 diff = 60 minutes
local date-time = 1992-09-26T16:52:58 diff = 0 minutes
local date-time = 1993-09-25T15:52:58 diff = 60 minutes
local date-time = 1993-09-25T16:52:58 diff = 0 minutes
local date-time = 1994-09-24T15:52:58 diff = 60 minutes
local date-time = 1994-09-24T16:52:58 diff = 0 minutes
local date-time = 1995-09-23T15:52:58 diff = 60 minutes
local date-time = 1995-09-23T16:52:58 diff = 0 minutes
local date-time = 1996-10-26T15:52:58 diff = 60 minutes
local date-time = 1996-10-26T16:52:58 diff = 0 minutes
local 

[jira] [Created] (SPARK-31327) write spark version to avro file metadata

2020-04-02 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31327:
---

 Summary: write spark version to avro file metadata
 Key: SPARK-31327
 URL: https://issues.apache.org/jira/browse/SPARK-31327
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29153) ResourceProfile conflict resolution stage level scheduling

2020-04-02 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-29153.
---
Fix Version/s: 3.1.0
 Assignee: Thomas Graves
   Resolution: Fixed

> ResourceProfile conflict resolution stage level scheduling
> --
>
> Key: SPARK-29153
> URL: https://issues.apache.org/jira/browse/SPARK-29153
> Project: Spark
>  Issue Type: Story
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> For the stage level scheduling, if a stage has ResourceProfiles from multiple 
> RDD that conflict we have to resolve that conflict.
> We may have 2 approaches. 
>  # default to error out if conflicting, that way user realizes what is going 
> on, have a config to turn this on and off.
>  # If config to error out if off, then resolve the conflict.  See below from 
> the design doc on the SPIP.
> For the merge strategy we can choose the max from the ResourceProfiles to 
> make the largest container required. This in general will work but there are 
> a few cases people may have intended them to be a sum.  For instance lets say 
> one RDD needs X memory and another RDD needs Y memory. It might be when those 
> get combined into a stage you really need X+Y memory vs the max(X, Y). 
> Another example might be union, where you would want to sum the resources of 
> each RDD. I think we can document what we choose for now and later on add in 
> the ability to have other alternatives then max.  Or perhaps we do need to 
> change what we do either per operation or per resource type. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait

2020-04-02 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-31179.
---
Fix Version/s: 3.1.0
 Assignee: feiwang
   Resolution: Fixed

> Fast fail the connection while last shuffle connection failed in the last 
> retry IO wait 
> 
>
> Key: SPARK-31179
> URL: https://issues.apache.org/jira/browse/SPARK-31179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Assignee: feiwang
>Priority: Major
> Fix For: 3.1.0
>
>
> When reading shuffle data, maybe several fetch request sent to a same shuffle 
> server.
> There is a client pool, and these request may share the same client.
> When the shuffle server is busy, it may cause the request connection timeout.
> For example: there are two request connection, rc1 and rc2.
> Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 
> minutes.
> 1: rc1 hold the client lock, it timeout after 2 minutes.
> 2: rc2 hold the client lock, it timeout after 2 minutes.
> 3: rc1 start the second retry, hold lock and timeout after 2 minutes.
> 4: rc2 start the second retry, hold lock and timeout after 2 minutes.
> 5: rc1 start the third retry, hold lock and timeout after 2 minutes.
> 6: rc2 start the third retry, hold lock and timeout after 2 minutes.
> It wastes lots of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31315) SQLQueryTestSuite: Display the total compile time for generated java code.

2020-04-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31315.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28081
[https://github.com/apache/spark/pull/28081]

> SQLQueryTestSuite: Display the total compile time for generated java code.
> --
>
> Key: SPARK-31315
> URL: https://issues.apache.org/jira/browse/SPARK-31315
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> SQLQueryTestSuite spent a lot of time compiling the generated java code.
> We should display the total compile time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31315) SQLQueryTestSuite: Display the total compile time for generated java code.

2020-04-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31315:
---

Assignee: jiaan.geng

> SQLQueryTestSuite: Display the total compile time for generated java code.
> --
>
> Key: SPARK-31315
> URL: https://issues.apache.org/jira/browse/SPARK-31315
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> SQLQueryTestSuite spent a lot of time compiling the generated java code.
> We should display the total compile time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30839) Add version information for Spark configuration

2020-04-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30839.
--
Resolution: Done

Thanks, [~beliefer] for working on this.

> Add version information for Spark configuration
> ---
>
> Key: SPARK-30839
> URL: https://issues.apache.org/jira/browse/SPARK-30839
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, 
> SQL, Structured Streaming, YARN
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark ConfigEntry and ConfigBuilder missing Spark version information of each 
> configuration at release. This is not good for Spark user when they visiting 
> the page of spark configuration.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31321) Remove SaveMode check in v2 FileWriteBuilder

2020-04-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31321:
---

Assignee: Kent Yao

> Remove SaveMode check in v2 FileWriteBuilder
> 
>
> Key: SPARK-31321
> URL: https://issues.apache.org/jira/browse/SPARK-31321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> SaveMode is never assigned, so it will fail when calling `validateInputs`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31321) Remove SaveMode check in v2 FileWriteBuilder

2020-04-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31321.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28090
[https://github.com/apache/spark/pull/28090]

> Remove SaveMode check in v2 FileWriteBuilder
> 
>
> Key: SPARK-31321
> URL: https://issues.apache.org/jira/browse/SPARK-31321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> SaveMode is never assigned, so it will fail when calling `validateInputs`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-04-02 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073463#comment-17073463
 ] 

Wenchen Fan commented on SPARK-30951:
-

Theoretically, Parquet spec implicitly requires Gregorian calendar by referring 
to the Java 8 time API: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp

That said, Spark 2.x writes "wrong" datetime values to Parquet, and I don't 
think we should keep this "wrong" behavior by default in 3.0. Besides, you will 
hit the mixed calendar Parquet files anyway if the data is written by multiple 
systems (e.g. Spark and Hive).

I'd suggest users turn on the legacy config only if they have legacy datetime 
values in Parquet that are before 1582. To make users easier to realize the 
existence of these legacy data, we can fail by default when reading datetime 
values before 1582 from parquet files. 

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro 

[jira] [Created] (SPARK-31326) create Function docs structure for SQL Reference

2020-04-02 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31326:
--

 Summary: create Function docs structure for SQL Reference
 Key: SPARK-31326
 URL: https://issues.apache.org/jira/browse/SPARK-31326
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


create Function docs structure for SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

2020-04-02 Thread Lukas Thaler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073422#comment-17073422
 ] 

Lukas Thaler commented on SPARK-31299:
--

Oh dear. Now, that's embarrassing. Thank you for pointing this out

> Pyspark.ml.clustering illegalArgumentException with dataframe created from 
> rows
> ---
>
> Key: SPARK-31299
> URL: https://issues.apache.org/jira/browse/SPARK-31299
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Lukas Thaler
>Priority: Major
>
> I hope this is the right place and way to report a bug in (at least) the 
> PySpark API:
> BisectingKMeans in the following example is only exemplary, the error occurs 
> with all clustering algorithms:
> {code:python}
> from pyspark.sql import Row
> from pyspark.mllib.linalg import DenseVector
> from pyspark.ml.clustering import BisectingKMeans
> data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 
> 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
>  Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
>  Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
> kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
> model = kmeans.fit(data)
> {code}
> The .fit-call in the last line will fail with the following error:
> {code:java}
> Py4JJavaError: An error occurred while calling o51.fit.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> test_features must be of type equal to one of the following types: 
> [struct,values:array>, 
> array, array] but was actually of type 
> struct,values:array>.
> {code}
> As can be seen, the data type reported to be passed to the function is the 
> first data type in the list of allowed data types, yet the call ends in an 
> error because of it.
> See my [StackOverflow 
> issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
>  for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation

2020-04-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073409#comment-17073409
 ] 

Dongjoon Hyun commented on SPARK-31312:
---

Since I saw your opinion, I'll not ping you about that again.

> Transforming Hive simple UDF (using JAR) expression may incur CNFE in later 
> evaluation
> --
>
> Key: SPARK-31312
> URL: https://issues.apache.org/jira/browse/SPARK-31312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of 
> current thread context classloader.
> [~cloud_fan] pointed out another potential issue in post-review of 
> SPARK-26560 - quoting the comment:
> {quote}
> Found a potential problem: here we call HiveSimpleUDF.dateType (which is a 
> lazy val), to force to load the class with the corrected class loader.
> However, if the expression gets transformed later, which copies 
> HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class 
> loading, and at that time there is no guarantee that the corrected 
> classloader is used.
> I think we should materialize the loaded class in HiveSimpleUDF.
> {quote}
> This JIRA issue is to track the effort of verifying the potential issue and 
> fixing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation

2020-04-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073407#comment-17073407
 ] 

Dongjoon Hyun commented on SPARK-31312:
---

It's for informimg the users (and the downstream distributors) the risk and to 
recommend upgrade their versions. If we set 2.4.5 only, that can be also 
considered as a bug occurred at 2.4.5 .

If we set 2.3.x at least, all 2.4.0 ~ 2.4.4 users also understand the risk.

> Transforming Hive simple UDF (using JAR) expression may incur CNFE in later 
> evaluation
> --
>
> Key: SPARK-31312
> URL: https://issues.apache.org/jira/browse/SPARK-31312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of 
> current thread context classloader.
> [~cloud_fan] pointed out another potential issue in post-review of 
> SPARK-26560 - quoting the comment:
> {quote}
> Found a potential problem: here we call HiveSimpleUDF.dateType (which is a 
> lazy val), to force to load the class with the corrected class loader.
> However, if the expression gets transformed later, which copies 
> HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class 
> loading, and at that time there is no guarantee that the corrected 
> classloader is used.
> I think we should materialize the loaded class in HiveSimpleUDF.
> {quote}
> This JIRA issue is to track the effort of verifying the potential issue and 
> fixing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org