date:20200622



 [ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26905:

Fix Version/s: 3.1.0

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31950) Extract SQL keywords from the generated parser class in TableIdentifierParserSuite



 [ 
https://issues.apache.org/jira/browse/SPARK-31950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31950:

Fix Version/s: 3.1.0

> Extract SQL keywords from the generated parser class in 
> TableIdentifierParserSuite
> --
>
> Key: SPARK-31950
> URL: https://issues.apache.org/jira/browse/SPARK-31950
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31230) use statement plans in DataFrameWriter(V2)



 [ 
https://issues.apache.org/jira/browse/SPARK-31230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31230:

Fix Version/s: 3.1.0

> use statement plans in DataFrameWriter(V2)
> --
>
> Key: SPARK-31230
> URL: https://issues.apache.org/jira/browse/SPARK-31230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore



 [ 
https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31584:

Fix Version/s: 3.1.0

> NullPointerException when parsing event log with InMemoryStore
> --
>
> Key: SPARK-31584
> URL: https://issues.apache.org/jira/browse/SPARK-31584
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
> Attachments: errorstack.txt
>
>
> I compiled with the current branch-3.0 source and tested it in mac os. A 
> java.lang.NullPointerException will be thrown when below conditions are met: 
>  # Using InMemoryStore as kvstore when parsing the event log file (e.g., when 
> spark.history.store.path is unset). 
>  # At least one stage in this event log has task number greater than 
> spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to 
> delete extra task records.
>  # The job has more than one stage, so parentToChildrenMap in 
> InMemoryStore.java will have more than one key.
> The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In 
> the method deleteParentIndex().
> {code:java}
> private void deleteParentIndex(Object key) {
>   if (hasNaturalParentIndex) {
> for (NaturalKeys v : parentToChildrenMap.values()) {
>   if (v.remove(asKey(key))) {
> // `v` can be empty after removing the natural key and we can 
> remove it from
> // `parentToChildrenMap`. However, `parentToChildrenMap` is a 
> ConcurrentMap and such
> // checking and deleting can be slow.
> // This method is to delete one object with certain key, let's 
> make it simple here.
> break;
>   }
> }
>   }
> }{code}
> In “if (v.remove(asKey(key)))”, if the key is not contained in v,  
> "v.remove(asKey(key))" will return null, and java will throw a 
> NullPointerException when executing "if (null)".
> An exception stack trace is attached.
> This issue can be fixed by updating if statement to
> {code:java}
> if (v.remove(asKey(key)) != null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31918:

Target Version/s: 3.0.1

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29226) Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.

2020-06-22 Thread ThimmeGowda (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142631#comment-17142631
 ] 

ThimmeGowda commented on SPARK-29226:
-

 Hi,

am using spark 2.4.5 and can i upgrade jackson-databind from 2.6.7.3 to 2.9.10 
? I tried changing as above in all files mentioned in PR, but got compilataion 
error for spark-core.


maven-dependency-plugin:3.0.2:build-classpath (default-cli) @ spark-core_2.11
The dependency classpath does not have scala-reflect-2.11.12.jar
Thanks
 

> Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.
> ---
>
> Key: SPARK-29226
> URL: https://issues.apache.org/jira/browse/SPARK-29226
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 
> and it will cause a security vulnerabilities. We could get some security info 
> from https://www.tenable.com/cve/CVE-2019-16335
> This reference remind to upgrate the version of `jackson-databind` to 2.9.10 
> or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142620#comment-17142620
 ] 

Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 4:35 AM:


I tested it manually with the fix I mentioned 
[here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127]
 .. let me test that case too.

BTW, I just roughly tested instead of running the full tests. Some corner cases 
might not work when running SparkR built by R 4.0.1 on R 3.6.3.

Let me test a bit more closely and share the results later.


was (Author: hyukjin.kwon):
I tested it manually with the fix I mentioned 
[here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127]
 .. let me test that case too.

BTW, I just roughly tested instead of running the full tests. Some corner cases 
might not work when running SparkR built by R 4.0.1 on R 3.6.3.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142620#comment-17142620
 ] 

Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 4:35 AM:


I tested it manually with the fix I mentioned 
[here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127]
 .. let me test that case too.

BTW, I just roughly tested instead of running the full tests. Some corner cases 
might not work when running SparkR built by R 4.0.1 on R 3.6.3.


was (Author: hyukjin.kwon):
I tested it manually with the fix I mentioned 
[here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127]
 .. let me test that case too.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142620#comment-17142620
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

I tested it manually with the fix I mentioned 
[here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127]
 .. let me test that case too.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142618#comment-17142618
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

Thats great! [~hyukjin.kwon] -- so we can get around the installation issue if 
we can build on R 4.0.0. However I guess we will still have the the 
serialization issue. BTW does the serialization issue go away if we build in R 
4.0.0 and run with R 3.6.3? 


> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31801) Register shuffle map output metadata with a shuffle output tracker



[ 
https://issues.apache.org/jira/browse/SPARK-31801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142616#comment-17142616
 ] 

Apache Spark commented on SPARK-31801:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/28902

> Register shuffle map output metadata with a shuffle output tracker
> --
>
> Key: SPARK-31801
> URL: https://issues.apache.org/jira/browse/SPARK-31801
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> Part of the design as discussed in [this 
> document|https://docs.google.com/document/d/1Aj6IyMsbS2sdIfHxLvIbHUNjHIWHTabfknIPoxOrTjk/edit#].
> Establish a {{ShuffleOutputTracker}} API that resides on the driver, and 
> handle accepting map output metadata returned by the map output writers and 
> send them to the output tracker module accordingly.
> Requires https://issues.apache.org/jira/browse/SPARK-31798.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142605#comment-17142605
 ] 

Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 4:08 AM:


Okay, [~shivaram], the first option seems working although it shows a warning 
such as below. I built Spark 3.0.0 with R 4.0.1, and manually downgraded to R 
3.6.3.

{code:java}
During startup - Warning message:
package ‘SparkR’ was built under R version 4.0.1
{code}

I removed unrelated comments I left above.


was (Author: hyukjin.kwon):
Okay, [~shivaram], the first option seems working although it shows a warning 
such as below. I build Spark 3.0.0 with 4.0.1, and manually downgraded to R 
3.6.3.

{code:java}
During startup - Warning message:
package ‘SparkR’ was built under R version 4.0.1
{code}

I removed unrelated comments I left above.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31918:
-
Comment: was deleted

(was: Nice, [~shivaram].

I just quickly tested, and the first option is not working.

1. Build Spark 3.0.0 in R 4.0.1 and install it from source with R 3.4.0 in 
another machine:

{code}
install.packages("SparkR_3.0.0.tar.gz", repos = NULL, type = "source")
{code}

{code}
df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df)))
{code}

It shows the same error as shown in 
https://cran.r-project.org/web/checks/check_results_SparkR.html


2. Build Spark 3.0.0 in R 4.0.1, loads the library directly with R 3.4.0 in 
another machine:

{code}
library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", 
"lib")))
{code}

{code}
# this error message is translated from another language. My R in Mac is in 
Korean
Error listing packages, Error in readRDS(pfile): cannot read workspace version 
3 written by R 4.0.1. R version should be 3.5+
{code}


3. Download Spark 3.0.0 release, loads the library directly with R 3.4.0 in 
another machine:

{code}
library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", 
"lib")))
{code}

{code}
# this error message is translated from another language. My R in Mac is in 
Korean
Error listing packages, Error in readRDS(pfile): cannot read workspace version 
3 written by R 3.6.3. R version should be 3.5+
{code}
)

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31918:
-
Comment: was deleted

(was: Oh, wait, the worker should test SparkR built with R 4.0.1. In the first 
case, I guess R worker loaded the one from 3.0.0 download (which is R 3.6.3). 
Let me test it via overwriting it.)

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142605#comment-17142605
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

Okay, [~shivaram], the first option seems working although it shows a warning 
such as below. I build Spark 3.0.0 with 4.0.1, and manually downgraded to R 
3.6.3.

{code:java}
During startup - Warning message:
package ‘SparkR’ was built under R version 4.0.1
{code}

I removed unrelated comments I left above.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32064) Supporting create temporary table



 [ 
https://issues.apache.org/jira/browse/SPARK-32064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32064:


Assignee: Apache Spark

> Supporting create temporary table
> -
>
> Key: SPARK-32064
> URL: https://issues.apache.org/jira/browse/SPARK-32064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> The basic code to implement the Spark native temporary table. See SPARK-32063



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32064) Supporting create temporary table



[ 
https://issues.apache.org/jira/browse/SPARK-32064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142603#comment-17142603
 ] 

Apache Spark commented on SPARK-32064:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28901

> Supporting create temporary table
> -
>
> Key: SPARK-32064
> URL: https://issues.apache.org/jira/browse/SPARK-32064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> The basic code to implement the Spark native temporary table. See SPARK-32063



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32064) Supporting create temporary table



 [ 
https://issues.apache.org/jira/browse/SPARK-32064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32064:


Assignee: (was: Apache Spark)

> Supporting create temporary table
> -
>
> Key: SPARK-32064
> URL: https://issues.apache.org/jira/browse/SPARK-32064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> The basic code to implement the Spark native temporary table. See SPARK-32063



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142591#comment-17142591
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

Oh, wait, the worker should test SparkR built with R 4.0.1. In the first case, 
I guess R worker loaded the one from 3.0.0 download (which is R 3.6.3). Let me 
test it via overwriting it.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-22 Thread James Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Summary: [K8s] Pod template from subsequently submission inadvertently 
applies to ongoing submission  (was: [K8s] Pod template from subsequently 
submission inadvertently applies to the ongoing submission)

> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different pod templates to K8s sequentially, and app2 launches while 
> app1 is still ramping up all its executor pods. The unwanted result is that 
> some launched executor pods of app1 appear to have app2's pod template 
> applied.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the launching period because the configmap names of the two apps are 
> the same. This causes some app1's executor pods being ramped up after app2 is 
> launched to be inadvertently launched with the app2's pod template.
> First, submit app1
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  app1--podspec-configmap1   13m57s
> default  app2--podspec-configmap1   13m57s{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to the ongoing submission

2020-06-22 Thread James Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different pod templates to K8s sequentially, and app2 launches while app1 is 
still ramping up all its executor pods. The unwanted result is that some 
launched executor pods of app1 appear to have app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the launching period because the configmap names of the two apps are the same. 
This causes some app1's executor pods being ramped up after app2 is launched to 
be inadvertently launched with the app2's pod template.

First, submit app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different pod templates to K8s sequentially, and app2 launches while app1 is 
still ramping up all its executor pods. The unwanted result is that some 
launched executor pods of app1 appear to have app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the launching period because the configmap names of the two apps are the same. 
This causes some app1's executor pods being ramped up after app2 is launched to 
be inadvertently launched with the app2's pod template.

# Launch app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}

# Then launch app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to the 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different pod templates to K8s sequentially, and app2 launches while 
> app1 is still ramping up all its executor pods. The unwanted result is that 
> some launched executor pods of app1 appear to have app2's pod template 
> applied.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the launching period because the configmap names of the two apps are 
> the same. This causes some app1's executor pods being ramped up after app2 is 
> launched to be inadvertently launched with the app2's pod template.
> First, submit app1
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its

[jira] [Comment Edited] (SPARK-31998) Change package references for ArrowBuf



[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417
 ] 

YoungGyu Chun edited comment on SPARK-31998 at 6/23/20, 3:14 AM:
-

[~fan_li_ya] [~kou]

I assume that this change will be applied to v1,0. Let us know when v1.0 will 
be released.


was (Author: younggyuchun):
[~fan_li_ya] [~kou]

let us know when v1.0 will be released.

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to the ongoing submission

2020-06-22 Thread James Yu (Jira)

James Yu created SPARK-32067:


 Summary: [K8s] Pod template from subsequently submission 
inadvertently applies to the ongoing submission
 Key: SPARK-32067
 URL: https://issues.apache.org/jira/browse/SPARK-32067
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0, 2.4.6
Reporter: James Yu


THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different pod templates to K8s sequentially, and app2 launches while app1 is 
still ramping up all its executor pods. The unwanted result is that some 
launched executor pods of app1 appear to have app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the launching period because the configmap names of the two apps are the same. 
This causes some app1's executor pods being ramped up after app2 is launched to 
be inadvertently launched with the app2's pod template.

# Launch app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}

# Then launch app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31998) Change package references for ArrowBuf



[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417
 ] 

YoungGyu Chun edited comment on SPARK-31998 at 6/23/20, 3:12 AM:
-

[~fan_li_ya] [~kou]

let us know when v1.0 will be released.


was (Author: younggyuchun):
I will be working on this when v1.0 is out

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32059) Nested Schema Pruning not Working in Window Functions

2020-06-22 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32059:

Issue Type: Improvement  (was: Bug)

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Yin
>Priority: Major
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32056) Repartition by key should support partition coalesce for AQE



[ 
https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142571#comment-17142571
 ] 

Apache Spark commented on SPARK-32056:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28900

> Repartition by key should support partition coalesce for AQE
> 
>
> Key: SPARK-32056
> URL: https://issues.apache.org/jira/browse/SPARK-32056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark release 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> when adaptive query execution is enabled the following expression should 
> support coalescing of partitions:
> {code:java}
> dataframe.repartition(col("somecolumn")) {code}
> currently it does not because it simply calls the repartition implementation 
> where number of partitions is specified:
> {code:java}
>   def repartition(partitionExprs: Column*): Dataset[T] = {
> repartition(sparkSession.sessionState.conf.numShufflePartitions, 
> partitionExprs: _*)
>   }{code}
> and repartition with the number of partitions specified does now allow for 
> coalescing of partitions (since this breaks the user's expectation that it 
> will have the number of partitions specified).
> for more context see the discussion here:
> [https://github.com/apache/spark/pull/27986]
> a simple test to confirm that repartition by key does not support coalescing 
> of partitions can be added in AdaptiveQueryExecSuite like this (it currently 
> fails):
> {code:java}
>   test("SPARK-32056 repartition has less partitions for small data when 
> adaptiveExecutionEnabled") {
> Seq(true, false).foreach { enableAQE =>
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "50",
> SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
> val partitionsNum = (1 to 10).toDF.repartition($"value")
>   .rdd.collectPartitions().length
> if (enableAQE) {
>   assert(partitionsNum < 50)
> } else {
>   assert(partitionsNum === 50)
> }
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32056) Repartition by key should support partition coalesce for AQE



 [ 
https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32056:


Assignee: Apache Spark

> Repartition by key should support partition coalesce for AQE
> 
>
> Key: SPARK-32056
> URL: https://issues.apache.org/jira/browse/SPARK-32056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark release 3.0.0
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> when adaptive query execution is enabled the following expression should 
> support coalescing of partitions:
> {code:java}
> dataframe.repartition(col("somecolumn")) {code}
> currently it does not because it simply calls the repartition implementation 
> where number of partitions is specified:
> {code:java}
>   def repartition(partitionExprs: Column*): Dataset[T] = {
> repartition(sparkSession.sessionState.conf.numShufflePartitions, 
> partitionExprs: _*)
>   }{code}
> and repartition with the number of partitions specified does now allow for 
> coalescing of partitions (since this breaks the user's expectation that it 
> will have the number of partitions specified).
> for more context see the discussion here:
> [https://github.com/apache/spark/pull/27986]
> a simple test to confirm that repartition by key does not support coalescing 
> of partitions can be added in AdaptiveQueryExecSuite like this (it currently 
> fails):
> {code:java}
>   test("SPARK-32056 repartition has less partitions for small data when 
> adaptiveExecutionEnabled") {
> Seq(true, false).foreach { enableAQE =>
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "50",
> SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
> val partitionsNum = (1 to 10).toDF.repartition($"value")
>   .rdd.collectPartitions().length
> if (enableAQE) {
>   assert(partitionsNum < 50)
> } else {
>   assert(partitionsNum === 50)
> }
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32056) Repartition by key should support partition coalesce for AQE



 [ 
https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32056:


Assignee: (was: Apache Spark)

> Repartition by key should support partition coalesce for AQE
> 
>
> Key: SPARK-32056
> URL: https://issues.apache.org/jira/browse/SPARK-32056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark release 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> when adaptive query execution is enabled the following expression should 
> support coalescing of partitions:
> {code:java}
> dataframe.repartition(col("somecolumn")) {code}
> currently it does not because it simply calls the repartition implementation 
> where number of partitions is specified:
> {code:java}
>   def repartition(partitionExprs: Column*): Dataset[T] = {
> repartition(sparkSession.sessionState.conf.numShufflePartitions, 
> partitionExprs: _*)
>   }{code}
> and repartition with the number of partitions specified does now allow for 
> coalescing of partitions (since this breaks the user's expectation that it 
> will have the number of partitions specified).
> for more context see the discussion here:
> [https://github.com/apache/spark/pull/27986]
> a simple test to confirm that repartition by key does not support coalescing 
> of partitions can be added in AdaptiveQueryExecSuite like this (it currently 
> fails):
> {code:java}
>   test("SPARK-32056 repartition has less partitions for small data when 
> adaptiveExecutionEnabled") {
> Seq(true, false).foreach { enableAQE =>
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "50",
> SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
> val partitionsNum = (1 to 10).toDF.repartition($"value")
>   .rdd.collectPartitions().length
> if (enableAQE) {
>   assert(partitionsNum < 50)
> } else {
>   assert(partitionsNum === 50)
> }
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32056) Repartition by key should support partition coalesce for AQE



[ 
https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142570#comment-17142570
 ] 

Apache Spark commented on SPARK-32056:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28900

> Repartition by key should support partition coalesce for AQE
> 
>
> Key: SPARK-32056
> URL: https://issues.apache.org/jira/browse/SPARK-32056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark release 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> when adaptive query execution is enabled the following expression should 
> support coalescing of partitions:
> {code:java}
> dataframe.repartition(col("somecolumn")) {code}
> currently it does not because it simply calls the repartition implementation 
> where number of partitions is specified:
> {code:java}
>   def repartition(partitionExprs: Column*): Dataset[T] = {
> repartition(sparkSession.sessionState.conf.numShufflePartitions, 
> partitionExprs: _*)
>   }{code}
> and repartition with the number of partitions specified does now allow for 
> coalescing of partitions (since this breaks the user's expectation that it 
> will have the number of partitions specified).
> for more context see the discussion here:
> [https://github.com/apache/spark/pull/27986]
> a simple test to confirm that repartition by key does not support coalescing 
> of partitions can be added in AdaptiveQueryExecSuite like this (it currently 
> fails):
> {code:java}
>   test("SPARK-32056 repartition has less partitions for small data when 
> adaptiveExecutionEnabled") {
> Seq(true, false).foreach { enableAQE =>
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "50",
> SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
> val partitionsNum = (1 to 10).toDF.repartition($"value")
>   .rdd.collectPartitions().length
> if (enableAQE) {
>   assert(partitionsNum < 50)
> } else {
>   assert(partitionsNum === 50)
> }
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32065) Supporting analyze temporary table

Lantao Jin created SPARK-32065:
--

 Summary: Supporting analyze temporary table
 Key: SPARK-32065
 URL: https://issues.apache.org/jira/browse/SPARK-32065
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


Supporting analyze temporary table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142566#comment-17142566
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

Nice, [~shivaram].

I just quickly tested, and the first option is not working.

1. Build Spark 3.0.0 in R 4.0.1 and install it from source with R 3.4.0 in 
another machine:

{code}
install.packages("SparkR_3.0.0.tar.gz", repos = NULL, type = "source")
{code}

{code}
df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df)))
{code}

It shows the same error as shown in 
https://cran.r-project.org/web/checks/check_results_SparkR.html


2. Build Spark 3.0.0 in R 4.0.1, loads the library directly with R 3.4.0 in 
another machine:

{code}
library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", 
"lib")))
{code}

{code}
# this error message is translated from another language. My R in Mac is in 
Korean
Error listing packages, Error in readRDS(pfile): cannot read workspace version 
3 written by R 4.0.1. R version should be 3.5+
{code}


3. Download Spark 3.0.0 release, loads the library directly with R 3.4.0 in 
another machine:

{code}
library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", 
"lib")))
{code}

{code}
# this error message is translated from another language. My R in Mac is in 
Korean
Error listing packages, Error in readRDS(pfile): cannot read workspace version 
3 written by R 3.6.3. R version should be 3.5+
{code}


> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32066) Supporting create temporary table LIKE

Lantao Jin created SPARK-32066:
--

 Summary: Supporting create temporary table LIKE
 Key: SPARK-32066
 URL: https://issues.apache.org/jira/browse/SPARK-32066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


Supporting create temporary table LIKE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32064) Supporting create temporary table

Lantao Jin created SPARK-32064:
--

 Summary: Supporting create temporary table
 Key: SPARK-32064
 URL: https://issues.apache.org/jira/browse/SPARK-32064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


The basic code to implement the Spark native temporary table. See SPARK-32063



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142558#comment-17142558
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

I can confirm that with build from source of Spark 3.0.0 and R 4.0.2, I see the 
following error while building vignettes.

{{R worker produced errors: Error in lapply(part, FUN) : attempt to bind a 
variable to R_UnboundValue}}

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32062) Reset listenerRegistered in SparkSession



 [ 
https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32062:


Assignee: Apache Spark

> Reset listenerRegistered in SparkSession
> 
>
> Key: SPARK-32062
> URL: https://issues.apache.org/jira/browse/SPARK-32062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32062) Reset listenerRegistered in SparkSession



 [ 
https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32062:


Assignee: (was: Apache Spark)

> Reset listenerRegistered in SparkSession
> 
>
> Key: SPARK-32062
> URL: https://issues.apache.org/jira/browse/SPARK-32062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32062) Reset listenerRegistered in SparkSession



[ 
https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142552#comment-17142552
 ] 

Apache Spark commented on SPARK-32062:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28899

> Reset listenerRegistered in SparkSession
> 
>
> Key: SPARK-32062
> URL: https://issues.apache.org/jira/browse/SPARK-32062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32062) Reset listenerRegistered in SparkSession



[ 
https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142553#comment-17142553
 ] 

Apache Spark commented on SPARK-32062:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28899

> Reset listenerRegistered in SparkSession
> 
>
> Key: SPARK-32062
> URL: https://issues.apache.org/jira/browse/SPARK-32062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32063) Spark native temporary table

Lantao Jin created SPARK-32063:
--

Summary: Spark native temporary table
Key: SPARK-32063
URL: https://issues.apache.org/jira/browse/SPARK-32063
Project: Spark
Issue Type: New Feature
Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin

Many databases and data warehouse SQL engines support temporary tables. A
temporary table, as its named implied, is a short-lived table that its life
will be only for current session.

In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS
SELECT” will create a temporary view. A temporary view is totally different
with a temporary table.

A temporary view is just a VIEW. It doesn’t materialize data in storage. So it
has below shortage:
# View will not give improved performance. Materialize intermediate data in
temporary tables for a complex query will accurate queries, especially in an
ETL pipeline.
# View which calls other views can cause severe performance issues. Even,
executing a very complex view may fail in Spark.
# Temporary view has no database namespace. In some complex ETL pipelines or
data warehouse applications, without database prefix is not convenient. It
needs some tables which only used in current session.

More details are described in [Design
Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32062) Reset listenerRegistered in SparkSession

2020-06-22 Thread ulysses you (Jira)

ulysses you created SPARK-32062:
---

 Summary: Reset listenerRegistered in SparkSession
 Key: SPARK-32062
 URL: https://issues.apache.org/jira/browse/SPARK-32062
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-22 Thread Toby Harradine (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142533#comment-17142533
 ] 

Toby Harradine edited comment on SPARK-25244 at 6/23/20, 2:09 AM:
--

Hi,

I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a 
difficult bug to work around when trying to validate datetimes in unit tests, 
which run on different machines with different timezones (and I'd prefer not to 
require use of Pandas to run unit tests).

Was this issue closed without resolution?

_Edit: Just tested on PySpark 3.0.0 with same outcome_.

Regards,
 Toby


was (Author: toby.harradine):
Hi,

I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a 
difficult bug to work around when trying to validate datetimes in unit tests, 
which run on different machines with different timezones (and I'd prefer not to 
require use of Pandas to run unit tests).

Was this issue closed without resolution?

Regards,
Toby

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-22 Thread Toby Harradine (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142533#comment-17142533
 ] 

Toby Harradine commented on SPARK-25244:


Hi,

I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a 
difficult bug to work around when trying to validate datetimes in unit tests, 
which run on different machines with different timezones (and I'd prefer not to 
require use of Pandas to run unit tests).

Was this issue closed without resolution?

Regards,
Toby

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142532#comment-17142532
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

[~hyukjin.kwon] I have R 4.0.2 and will try to do a fresh build from source of 
Spark 3.0.0 

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32061) potential regression if use memoryUsage instead of numRows

zhengruifeng created SPARK-32061:


 Summary: potential regression if use memoryUsage instead of numRows
 Key: SPARK-32061
 URL: https://issues.apache.org/jira/browse/SPARK-32061
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: zhengruifeng


1, if the `memoryUsage` is improperly set, for example, too small to store a 
instance;

2,  the blockify+GMM reuse two matrices whose shape is related to current 
blockSize:
{code:java}
@transient private lazy val auxiliaryProbMat = DenseMatrix.zeros(blockSize, k)
@transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, 
numFeatures) {code}
When implementing blockify+GMM, I found that if I do not pre-allocate those 
matrices, there will be seriously regression (maybe 3~4 slower, I fogot the 
detailed numbers);

3, in MLP, three pre-allocated objects are also related to numRows:
{code:java}
if (ones == null || ones.length != delta.cols) ones = 
BDV.ones[Double](delta.cols)

// TODO: allocate outputs as one big array and then create BDMs from it
if (outputs == null || outputs(0).cols != currentBatchSize) {
...

// TODO: allocate deltas as one big array and then create BDMs from it
if (deltas == null || deltas(0).cols != currentBatchSize) {
  deltas = new Array[BDM[Double]](layerModels.length)
... {code}
I am not very familiar with the impl of MLP and failed to find some related 
document about this pro-allocation. But I guess there maybe regression if we 
disable this pro-allocation, since those objects look relatively big.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142520#comment-17142520
 ] 

Dongjoon Hyun commented on SPARK-31918:
---

Unfortunately, no~ I downgraded to R 3.5.2 on both my MacPro and MacBook.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32060) Huber loss Convergence



 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Description: 
|performace test in https://issues.apache.org/jira/browse/SPARK-31783,
Huber loss seems start to diverge since 50 iters.
  {code:java}
 for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
    Thread.sleep(1)
    val hlir = new 
LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
    val start = System.currentTimeMillis
    val model = hlir.setBlockSize(size).fit(df)
    val end = System.currentTimeMillis
    println((model.uid, size, iter, end - start, 
model.summary.objectiveHistory.last, model.summary.totalIterations, 
model.coefficients.toString.take(100)))
}{code}|
| |
| |
| |
| |
| |
| |
| |
| |
|result:|
|blockSize=1|
|(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
|(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
|(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|

blockSize=4|
|(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
|(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
|(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|

blockSize=16|
|(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
|(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
|(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|

blockSize=64|
|(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
|(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
|(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|

  was:
|Huber loss seems start to diverge since 50 iters.
 
{code:java}
 for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
    Thread.sleep(1)
    val hlir = new 
LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
    val start = System.currentTimeMillis
    val model = hlir.setBlockSize(size).fit(df)
    val end = System.currentTimeMillis
    println((model.uid, size, iter, end - start, 
model.summary.objectiveHistory.last, model.summary.totalIterations, 
model.coefficients.toString.take(100)))
}{code}|
| |
| |
| |
| |
| |
| |
| |
| |
|result:|
|blockSize=1|
|(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
|(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
|(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
|
blockSize=4|
|(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
|(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
|(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
|
blockSize=16|
|(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
|(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
|(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
|
blockSize=64|

[jira] [Updated] (SPARK-32060) Huber loss Convergence



 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Parent: SPARK-30641
Issue Type: Sub-task  (was: Bug)

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
> Huber loss seems start to diverge since 50 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32060) Huber loss Convergence

zhengruifeng created SPARK-32060:


 Summary: Huber loss Convergence
 Key: SPARK-32060
 URL: https://issues.apache.org/jira/browse/SPARK-32060
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


|Huber loss seems start to diverge since 50 iters.
 
{code:java}
 for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
    Thread.sleep(1)
    val hlir = new 
LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
    val start = System.currentTimeMillis
    val model = hlir.setBlockSize(size).fit(df)
    val end = System.currentTimeMillis
    println((model.uid, size, iter, end - start, 
model.summary.objectiveHistory.last, model.summary.totalIterations, 
model.coefficients.toString.take(100)))
}{code}|
| |
| |
| |
| |
| |
| |
| |
| |
|result:|
|blockSize=1|
|(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
|(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
|(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
|
blockSize=4|
|(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
|(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
|(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
|
blockSize=16|
|(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
|(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
|(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
|
blockSize=64|
|(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
|(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
|(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32059) Nested Schema Pruning not Working in Window Functions



 [ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32059:


Assignee: (was: Apache Spark)

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Yin
>Priority: Major
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32059) Nested Schema Pruning not Working in Window Functions



[ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142515#comment-17142515
 ] 

Apache Spark commented on SPARK-32059:
--

User 'frankyin-factual' has created a pull request for this issue:
https://github.com/apache/spark/pull/28898

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Yin
>Priority: Major
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32059) Nested Schema Pruning not Working in Window Functions



 [ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32059:


Assignee: Apache Spark

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Yin
>Assignee: Apache Spark
>Priority: Major
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32059) Nested Schema Pruning not Working in Window Functions



[ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142514#comment-17142514
 ] 

Apache Spark commented on SPARK-32059:
--

User 'frankyin-factual' has created a pull request for this issue:
https://github.com/apache/spark/pull/28898

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Yin
>Priority: Major
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27148) Support CURRENT_TIME and LOCALTIME when ANSI mode enabled



 [ 
https://issues.apache.org/jira/browse/SPARK-27148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YoungGyu Chun resolved SPARK-27148.
---
Resolution: Later

> Support CURRENT_TIME and LOCALTIME when ANSI mode enabled
> -
>
> Key: SPARK-27148
> URL: https://issues.apache.org/jira/browse/SPARK-27148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> CURRENT_TIME and LOCALTIME should be supported in the ANSI standard;
> {code:java}
> postgres=# select CURRENT_TIME;
>        timetz       
> 
> 16:45:43.398109+09
> (1 row)
> postgres=# select LOCALTIME;
>       time      
> 
> 16:45:48.60969
> (1 row){code}
> Before this, we need to support TIME types (java.sql.Time).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27148) Support CURRENT_TIME and LOCALTIME when ANSI mode enabled



[ 
https://issues.apache.org/jira/browse/SPARK-27148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142511#comment-17142511
 ] 

YoungGyu Chun commented on SPARK-27148:
---

A similar ticket was concluded by the following comment by [~rxin]. Let's close 
this for now

[#25678 
(comment)|https://github.com/apache/spark/pull/25678#issuecomment-531585556]

> Support CURRENT_TIME and LOCALTIME when ANSI mode enabled
> -
>
> Key: SPARK-27148
> URL: https://issues.apache.org/jira/browse/SPARK-27148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> CURRENT_TIME and LOCALTIME should be supported in the ANSI standard;
> {code:java}
> postgres=# select CURRENT_TIME;
>        timetz       
> 
> 16:45:43.398109+09
> (1 row)
> postgres=# select LOCALTIME;
>       time      
> 
> 16:45:48.60969
> (1 row){code}
> Before this, we need to support TIME types (java.sql.Time).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7101) Spark SQL should support java.sql.Time



 [ 
https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YoungGyu Chun resolved SPARK-7101.
--
Resolution: Later

> Spark SQL should support java.sql.Time
> --
>
> Key: SPARK-7101
> URL: https://issues.apache.org/jira/browse/SPARK-7101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: All
>Reporter: Peter Hagelund
>Priority: Major
>
> Several RDBMSes support the TIME data type; for more exact mapping between 
> those and Spark SQL, support for java.sql.Time with an associated 
> DataType.TimeType would be helpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time



[ 
https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142510#comment-17142510
 ] 

YoungGyu Chun commented on SPARK-7101:
--

A similar ticket was concluded by the following comment by [~rxin]. Let's close 
this for now

[#25678 
(comment)|https://github.com/apache/spark/pull/25678#issuecomment-531585556]

> Spark SQL should support java.sql.Time
> --
>
> Key: SPARK-7101
> URL: https://issues.apache.org/jira/browse/SPARK-7101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: All
>Reporter: Peter Hagelund
>Priority: Major
>
> Several RDBMSes support the TIME data type; for more exact mapping between 
> those and Spark SQL, support for java.sql.Time with an associated 
> DataType.TimeType would be helpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-22 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 However, this query is only for reading data, and it does not support the 
common examples listed above.

On the other hand, there is `sessionInitStatement` option available before 
writing data from DataFrame.
 This option is to run custom SQL in order to implement session initialization 
code. Since it runs per session, it cannot be used for non-idempotent 
operations.

 

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]

  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 However, this query is only for reading data, and it does not support the 
common examples listed above.

On the other hand, there is `sessionInitStatement` option available before 
writing data from DataFrame.
This option is to run custom SQL in order to implement session initialization 
code. Since it runs per session, it cannot be used for write operations.

 

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]


> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
>  Here's examples;
>  - Create a view with specific conditions
>  - Delete/Update some records
>  - Truncate a table (it is already possible in `truncate` option)
>  - Execute stored procedure (it is also requested in SPARK-32014)
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
>  However, this query is only for reading data, and it does not support the 
> common examples listed above.
> On the other hand, there is `sessionInitStatement` option available before 
> writing data from DataFrame.
>  This option is to run custom SQL in order to implement session 
> initialization code. Since it runs per session, it cannot be used for 
> non-idempotent operations.
>  
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`. [https://github.com/databricks/spark-redshift]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-22 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 However, this query is only for reading data, and it does not support the 
common examples listed above.

On the other hand, there is `sessionInitStatement` option available before 
writing data from DataFrame.
This option is to run custom SQL in order to implement session initialization 
code. Since it runs per session, it cannot be used for write operations.

 

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]

  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
 However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]


> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
>  Here's examples;
>  - Create a view with specific conditions
>  - Delete/Update some records
>  - Truncate a table (it is already possible in `truncate` option)
>  - Execute stored procedure (it is also requested in SPARK-32014)
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
>  However, this query is only for reading data, and it does not support the 
> common examples listed above.
> On the other hand, there is `sessionInitStatement` option available before 
> writing data from DataFrame.
> This option is to run custom SQL in order to implement session initialization 
> code. Since it runs per session, it cannot be used for write operations.
>  
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`. [https://github.com/databricks/spark-redshift]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142495#comment-17142495
 ] 

Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 12:06 AM:
-

Ah, yeah. That one I read [it in the release 
notes|https://cran.r-project.org/doc/manuals/r-devel/NEWS.html]
I was freshly building and testing the package with R 4.0.1 so that was why the 
error messages were different ...

{quote}
> Packages need to be (re-)installed under this version (4.0.0) of *R*.
{quote}

I have two environments in my local. One is R 4.0.1, the other one is R 3.4.0. 
Although it officially says R 3.1+, we deprecated R < 3.4 at SPARK-26014.
I will test the first option out, and come back.

BTW, would you be able to test it out with a fresh build with R 4.0.0? If the 
issue I faced isn't my env issue, it looks tricky to handle ... [~dongjoon] do 
you have an existing SparkR dev env to test with R 4.0?


was (Author: hyukjin.kwon):
Ah, yeah. That one I read [it in the release 
notes|[https://cran.r-project.org/doc/manuals/r-devel/NEWS.html]]
I was freshly building and testing the package with R 4.0.1 so that was why the 
error messages were different ...

{quote}
> Packages need to be (re-)installed under this version (4.0.0) of *R*.
{quote}

I have two environments in my local. One is R 4.0.1, the other one is R 3.4.0. 
Although it officially says R 3.1+, we deprecated R < 3.4 at SPARK-26014.
I will test the first option out, and come back.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142495#comment-17142495
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

Ah, yeah. That one I read [it in the release 
notes|[https://cran.r-project.org/doc/manuals/r-devel/NEWS.html]]
I was freshly building and testing the package with R 4.0.1 so that was why the 
error messages were different ...

{quote}
> Packages need to be (re-)installed under this version (4.0.0) of *R*.
{quote}

I have two environments in my local. One is R 4.0.1, the other one is R 3.4.0. 
Although it officially says R 3.1+, we deprecated R < 3.4 at SPARK-26014.
I will test the first option out, and come back.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32059) Nested Schema Pruning not Working in Window Functions

2020-06-22 Thread Frank Yin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Yin updated SPARK-32059:
--
Description: 
Using tables and data structures in `SchemaPruningSuite.scala`

 
{code:java}
// code placeholder
case class FullName(first: String, middle: String, last: String)
case class Company(name: String, address: String)
case class Employer(id: Int, company: Company)
case class Contact(
  id: Int,
  name: FullName,
  address: String,
  pets: Int,
  friends: Array[FullName] = Array.empty,
  relatives: Map[String, FullName] = Map.empty,
  employer: Employer = null,
  relations: Map[FullName, String] = Map.empty)
case class Department(
  depId: Int,
  depName: String,
  contactId: Int,
  employer: Employer)
{code}
 

The query to run:
{code:java}
// code placeholder
select a.name.first from (select row_number() over (partition by address order 
by id desc) as __rank, contacts.* from contacts) a where a.name.first = 'A' AND 
a.__rank = 1
{code}
 

The current physical plan:
{code:java}
// code placeholder
== Physical Plan ==
*(3) Project [name#46.first AS first#74]
+- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
(name#46.first = A)) AND (__rank#71 = 1))
   +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
__rank#71], [address#47], [id#45 DESC NULLS LAST]
  +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0
 +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
+- *(1) Project [id#45, name#46, address#47]
   +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
false, DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct,address:string>
{code}
 

The desired physical plan:

 
{code:java}
// code placeholder
== Physical Plan ==
*(3) Project [_gen_alias_77#77 AS first#74]
+- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
(_gen_alias_77#77 = A)) AND (__rank#71 = 1))
   +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
__rank#71], [address#47], [id#45 DESC NULLS LAST]
  +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0
 +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
+- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
address#47]
   +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
false, DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct,address:string>
{code}

  was:
Using tables and data structures in `SchemaPruningSuite.scala`

```

case class FullName(first: String, middle: String, last: String)
case class Company(name: String, address: String)
case class Employer(id: Int, company: Company)
case class Contact(
 id: Int,
 name: FullName,
 address: String,
 pets: Int,
 friends: Array[FullName] = Array.empty,
 relatives: Map[String, FullName] = Map.empty,
 employer: Employer = null,
 relations: Map[FullName, String] = Map.empty)
case class Department(
 depId: Int,
 depName: String,
 contactId: Int,
 employer: Employer)

```

 

The query to run: `

select a.name.first from (select row_number() over (partition by address order 
by id desc) as __rank, contacts.* from contacts) a where a.name.first = 'A' AND 
a.__rank = 1`

 

The current physical plan:

```

== Physical Plan ==
*(3) Project [name#46.first AS first#74]
+- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
(name#46.first = A)) AND (__rank#71 = 1))
 +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
__rank#71], [address#47], [id#45 DESC NULLS LAST]
 +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0
 +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
 +- *(1) Project [id#45, name#46, address#47]
 +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct,address:string>

```

 

The desired physical plan:

```

== Physical Plan ==
*(3) Project [_gen_alias_77#77 AS first#74]
+- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
(_gen_alias_77#77 = A)) AND (__rank#71 = 1))
 +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS

[jira] [Created] (SPARK-32059) Nested Schema Pruning not Working in Window Functions

2020-06-22 Thread Frank Yin (Jira)

Frank Yin created SPARK-32059:
-

 Summary: Nested Schema Pruning not Working in Window Functions
 Key: SPARK-32059
 URL: https://issues.apache.org/jira/browse/SPARK-32059
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Frank Yin


Using tables and data structures in `SchemaPruningSuite.scala`

```

case class FullName(first: String, middle: String, last: String)
case class Company(name: String, address: String)
case class Employer(id: Int, company: Company)
case class Contact(
 id: Int,
 name: FullName,
 address: String,
 pets: Int,
 friends: Array[FullName] = Array.empty,
 relatives: Map[String, FullName] = Map.empty,
 employer: Employer = null,
 relations: Map[FullName, String] = Map.empty)
case class Department(
 depId: Int,
 depName: String,
 contactId: Int,
 employer: Employer)

```

 

The query to run: `

select a.name.first from (select row_number() over (partition by address order 
by id desc) as __rank, contacts.* from contacts) a where a.name.first = 'A' AND 
a.__rank = 1`

 

The current physical plan:

```

== Physical Plan ==
*(3) Project [name#46.first AS first#74]
+- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
(name#46.first = A)) AND (__rank#71 = 1))
 +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
__rank#71], [address#47], [id#45 DESC NULLS LAST]
 +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0
 +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
 +- *(1) Project [id#45, name#46, address#47]
 +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct,address:string>

```

 

The desired physical plan:

```

== Physical Plan ==
*(3) Project [_gen_alias_77#77 AS first#74]
+- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
(_gen_alias_77#77 = A)) AND (__rank#71 = 1))
 +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
__rank#71], [address#47], [id#45 DESC NULLS LAST]
 +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0
 +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
 +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, address#47]
 +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct,address:string>

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142466#comment-17142466
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

Thanks [~hyukjin.kwon]. It looks like there is another problem. From what I saw 
today, R 4.0.0 cannot load packages that were built with R 3.6.0.  Thus when 
SparkR workers try to start up with the pre-built SparkR package we see a 
failure.  I'm not really sure what is a good way to handle this. Options include
- Building the SparkR package using 4.0.0 (need to check if that works with R 
3.6)
- Copy the package from the driver (where it is usually built) and make the 
SparkR workers use the package installed on the driver

Any other ideas?

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log



 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32044.
---
Fix Version/s: 2.4.7
   Resolution: Fixed

Issue resolved by pull request 28887
[https://github.com/apache/spark/pull/28887]

> [SS] 2.4 Kafka continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Trivial
> Fix For: 2.4.7
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log



 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32044:
-

Assignee: Zhongwei Zhu

> [SS] 2.4 Kafka continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default



 [ 
https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32058:
--
Comment: was deleted

(was: User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28897)

> Use Apache Hadoop 3.2.0 dependency by default
> -
>
> Key: SPARK-32058
> URL: https://issues.apache.org/jira/browse/SPARK-32058
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default



 [ 
https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32058:


Assignee: Apache Spark

> Use Apache Hadoop 3.2.0 dependency by default
> -
>
> Key: SPARK-32058
> URL: https://issues.apache.org/jira/browse/SPARK-32058
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default



[ 
https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142449#comment-17142449
 ] 

Apache Spark commented on SPARK-32058:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28897

> Use Apache Hadoop 3.2.0 dependency by default
> -
>
> Key: SPARK-32058
> URL: https://issues.apache.org/jira/browse/SPARK-32058
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default



 [ 
https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32058:


Assignee: (was: Apache Spark)

> Use Apache Hadoop 3.2.0 dependency by default
> -
>
> Key: SPARK-32058
> URL: https://issues.apache.org/jira/browse/SPARK-32058
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default



[ 
https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142447#comment-17142447
 ] 

Apache Spark commented on SPARK-32058:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28897

> Use Apache Hadoop 3.2.0 dependency by default
> -
>
> Key: SPARK-32058
> URL: https://issues.apache.org/jira/browse/SPARK-32058
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default

Dongjoon Hyun created SPARK-32058:
-

 Summary: Use Apache Hadoop 3.2.0 dependency by default
 Key: SPARK-32058
 URL: https://issues.apache.org/jira/browse/SPARK-32058
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-06-22 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142426#comment-17142426
 ] 

Erik Krogen commented on SPARK-32037:
-

Thanks for the suggestions [~H4ml3t]!
* *quarantined* to me indicates that we want to avoid something bad about this 
spreading to other places (e.g. quarantining some corrupt data to protect other 
places from consuming it and spreading the corruption), which isn't the case 
here.
* *benched* is fun, but I think not very intuitive unless you're primed with 
the analogy. I also am a little concerned that it will make people think of 
benchmarks. "Benched? Did this node fail a benchmark?"
* *exiled* is interesting, but I think unhealthy still does a better job of 
conveying that the node/executor/etc. is doing something wrong

I'm not sure about other resource managers, but at least YARN also uses the 
concept of unhealthy vs. healthy to refer to nodes that are not performing well.

One other thing that came to mind for me was "misbehaving", which I think is 
really what we are describing by "unhealthy", but I think it sounds a little 
less smooth.

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31998) Change package references for ArrowBuf



[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417
 ] 

YoungGyu Chun edited comment on SPARK-31998 at 6/22/20, 9:16 PM:
-

I will be working on this when v1.0 is out


was (Author: younggyuchun):
I will be working on this

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf



[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417
 ] 

YoungGyu Chun commented on SPARK-31998:
---

I will be working on this

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED/CLOSED state correctly

2020-06-22 Thread Ali Smesseim (Jira)

Ali Smesseim created SPARK-32057:


 Summary: SparkExecuteStatementOperation does not set 
CANCELED/CLOSED state correctly 
 Key: SPARK-32057
 URL: https://issues.apache.org/jira/browse/SPARK-32057
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Smesseim


https://github.com/apache/spark/pull/28671 introduced changes that changed the 
way cleanup is done in SparkExecuteStatementOperation. In cancel(), cleanup 
(killing jobs) used to be done after setting state to CANCELED. Now, the order 
is reversed. Jobs are killed first, causing exception to be thrown inside 
execute(), so the status of the operation becomes ERROR before being set to 
CANCELED.

cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer



[ 
https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142403#comment-17142403
 ] 

Apache Spark commented on SPARK-32025:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/28896

> CSV schema inference with boolean & integer 
> 
>
> Key: SPARK-32025
> URL: https://issues.apache.org/jira/browse/SPARK-32025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Brian Wallace
>Priority: Major
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 4320
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is 
> that the column is inferred to be a string. However, it is inferred as a 
> boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, 
> multiLine=True).show()
> ++
> |col1|
> ++
> |null|
> |true|
> |null|
> ++
> {code}
> Note that this seems to work correctly if multiLine is set to False (although 
> we need to set it to True as this column may indeed span multiple lines in 
> general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32025) CSV schema inference with boolean & integer



 [ 
https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32025:


Assignee: Apache Spark

> CSV schema inference with boolean & integer 
> 
>
> Key: SPARK-32025
> URL: https://issues.apache.org/jira/browse/SPARK-32025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Brian Wallace
>Assignee: Apache Spark
>Priority: Major
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 4320
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is 
> that the column is inferred to be a string. However, it is inferred as a 
> boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, 
> multiLine=True).show()
> ++
> |col1|
> ++
> |null|
> |true|
> |null|
> ++
> {code}
> Note that this seems to work correctly if multiLine is set to False (although 
> we need to set it to True as this column may indeed span multiple lines in 
> general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32025) CSV schema inference with boolean & integer



 [ 
https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32025:


Assignee: (was: Apache Spark)

> CSV schema inference with boolean & integer 
> 
>
> Key: SPARK-32025
> URL: https://issues.apache.org/jira/browse/SPARK-32025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Brian Wallace
>Priority: Major
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 4320
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is 
> that the column is inferred to be a string. However, it is inferred as a 
> boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, 
> multiLine=True).show()
> ++
> |col1|
> ++
> |null|
> |true|
> |null|
> ++
> {code}
> Note that this seems to work correctly if multiLine is set to False (although 
> we need to set it to True as this column may indeed span multiple lines in 
> general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer



[ 
https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142402#comment-17142402
 ] 

Apache Spark commented on SPARK-32025:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/28896

> CSV schema inference with boolean & integer 
> 
>
> Key: SPARK-32025
> URL: https://issues.apache.org/jira/browse/SPARK-32025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Brian Wallace
>Priority: Major
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 4320
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is 
> that the column is inferred to be a string. However, it is inferred as a 
> boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, 
> multiLine=True).show()
> ++
> |col1|
> ++
> |null|
> |true|
> |null|
> ++
> {code}
> Note that this seems to work correctly if multiLine is set to False (although 
> we need to set it to True as this column may indeed span multiple lines in 
> general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32050) GBTClassifier not working with OnevsRest

2020-06-22 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142387#comment-17142387
 ] 

L. C. Hsieh commented on SPARK-32050:
-

I think this was fixed at SPARK-27007.

> GBTClassifier not working with OnevsRest
> 
>
> Key: SPARK-32050
> URL: https://issues.apache.org/jira/browse/SPARK-32050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
> Environment: spark 2.4.0
>Reporter: Raghuvarran V H
>Priority: Minor
>
> I am trying to use GBT classifier for multi class classification using 
> OnevsRest
>  
> {code:java}
> from pyspark.ml.classification import 
> MultilayerPerceptronClassifier,OneVsRest,GBTClassifier
> from pyspark.ml import Pipeline,PipelineModel
> lr = GBTClassifier(featuresCol='features', labelCol='label', 
> predictionCol='prediction', maxDepth=5,   
>
> maxBins=32,minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, 
> cacheNodeIds=False,checkpointInterval=10, lossType='logistic', 
> maxIter=20,stepSize=0.1, seed=None,subsamplingRate=1.0, 
> featureSubsetStrategy='auto')
> classifier = OneVsRest(featuresCol='features', labelCol='label', 
> predictionCol='prediction', classifier=lr,    weightCol=None,parallelism=1)
> pipeline = Pipeline(stages=[str_indxr,ohe,vecAssembler,normalizer,classifier])
> model = pipeline.fit(train_data)
> {code}
>  
>  
> When I try this I get this error:
> /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/ml/classification.py
>  in _fit(self, dataset)
>  1800 classifier = self.getClassifier()
>  1801 assert isinstance(classifier, HasRawPredictionCol),\
>  -> 1802 "Classifier %s doesn't extend from HasRawPredictionCol." % 
> type(classifier)
>  1803 
>  1804 numClasses = int(dataset.agg(\{labelCol: 
> "max"}).head()["max("+labelCol+")"]) + 1
> AssertionError: Classifier  
> doesn't extend from HasRawPredictionCol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32056) Repartition by key should support partition coalesce for AQE

2020-06-22 Thread koert kuipers (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-32056:
--
Priority: Minor  (was: Major)

> Repartition by key should support partition coalesce for AQE
> 
>
> Key: SPARK-32056
> URL: https://issues.apache.org/jira/browse/SPARK-32056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark release 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> when adaptive query execution is enabled the following expression should 
> support coalescing of partitions:
> {code:java}
> dataframe.repartition(col("somecolumn")) {code}
> currently it does not because it simply calls the repartition implementation 
> where number of partitions is specified:
> {code:java}
>   def repartition(partitionExprs: Column*): Dataset[T] = {
> repartition(sparkSession.sessionState.conf.numShufflePartitions, 
> partitionExprs: _*)
>   }{code}
> and repartition with the number of partitions specified does now allow for 
> coalescing of partitions (since this breaks the user's expectation that it 
> will have the number of partitions specified).
> for more context see the discussion here:
> [https://github.com/apache/spark/pull/27986]
> a simple test to confirm that repartition by key does not support coalescing 
> of partitions can be added in AdaptiveQueryExecSuite like this (it currently 
> fails):
> {code:java}
>   test("SPARK-32056 repartition has less partitions for small data when 
> adaptiveExecutionEnabled") {
> Seq(true, false).foreach { enableAQE =>
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "50",
> SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
> val partitionsNum = (1 to 10).toDF.repartition($"value")
>   .rdd.collectPartitions().length
> if (enableAQE) {
>   assert(partitionsNum < 50)
> } else {
>   assert(partitionsNum === 50)
> }
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32056) Repartition by key should support partition coalesce for AQE

2020-06-22 Thread koert kuipers (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-32056:
--
Description: 
when adaptive query execution is enabled the following expression should 
support coalescing of partitions:
{code:java}
dataframe.repartition(col("somecolumn")) {code}
currently it does not because it simply calls the repartition implementation 
where number of partitions is specified:
{code:java}
  def repartition(partitionExprs: Column*): Dataset[T] = {
repartition(sparkSession.sessionState.conf.numShufflePartitions, 
partitionExprs: _*)
  }{code}
and repartition with the number of partitions specified does now allow for 
coalescing of partitions (since this breaks the user's expectation that it will 
have the number of partitions specified).

for more context see the discussion here:

[https://github.com/apache/spark/pull/27986]

a simple test to confirm that repartition by key does not support coalescing of 
partitions can be added in AdaptiveQueryExecSuite like this (it currently 
fails):
{code:java}
  test("SPARK-32056 repartition has less partitions for small data when 
adaptiveExecutionEnabled") {
Seq(true, false).foreach { enableAQE =>
  withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "50",
SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
val partitionsNum = (1 to 10).toDF.repartition($"value")
  .rdd.collectPartitions().length
if (enableAQE) {
  assert(partitionsNum < 50)
} else {
  assert(partitionsNum === 50)
}
  }
}
  }
{code}
 

  was:
when adaptive query execution is enabled the following expression should 
support coalescing of partitions:
{code:java}
dataframe.repartition(col("somecolumn")) {code}
currently it does not because it simply calls the repartition implementation 
where number of partitions is specified:
{code:java}
  def repartition(partitionExprs: Column*): Dataset[T] = {
repartition(sparkSession.sessionState.conf.numShufflePartitions, 
partitionExprs: _*)
  }{code}
and repartition with the number of partitions specified does now allow for 
coalescing of partitions (since this breaks the user's expectation that it will 
have the number of partitions specified).

for more context see the discussion here:

[https://github.com/apache/spark/pull/27986]

a simple test to confirm that repartition by key does not support coalescing of 
partitions can be added in AdaptiveQueryExecSuite like this (it currently 
fails):
{code:java}
  test("SPARK-? repartition has less partitions for small data when 
adaptiveExecutionEnabled") {
Seq(true, false).foreach { enableAQE =>
  withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "50",
SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
val partitionsNum = (1 to 10).toDF.repartition($"value")
  .rdd.collectPartitions().length
if (enableAQE) {
  assert(partitionsNum < 50)
} else {
  assert(partitionsNum === 50)
}
  }
}
  }
{code}
 


> Repartition by key should support partition coalesce for AQE
> 
>
> Key: SPARK-32056
> URL: https://issues.apache.org/jira/browse/SPARK-32056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark release 3.0.0
>Reporter: koert kuipers
>Priority: Major
>
> when adaptive query execution is enabled the following expression should 
> support coalescing of partitions:
> {code:java}
> dataframe.repartition(col("somecolumn")) {code}
> currently it does not because it simply calls the repartition implementation 
> where number of partitions is specified:
> {code:java}
>   def repartition(partitionExprs: Column*): Dataset[T] = {
> repartition(sparkSession.sessionState.conf.numShufflePartitions, 
> partitionExprs: _*)
>   }{code}
> and repartition with the number of partitions specified does now allow for 
> coalescing of partitions (since this breaks the user's expectation that it 
> will have the number of partitions specified).
> for more context see the discussion here:
> [https://github.com/apache/spark/pull/27986]
> a simple test to confirm that repartition by key does not support coalescing 
> of partitions can be added in AdaptiveQueryExecSuite like this (it currently 
> fails):
> {code:java}
>   test("SPARK-32056 repartition has less partitions for small data when 
> adaptiveExecutionEnabled") {
> Seq(true, false).foreach { enableAQE =>
>   withSQLConf(

[jira] [Created] (SPARK-32056) Repartition by key should support partition coalesce for AQE

2020-06-22 Thread koert kuipers (Jira)

koert kuipers created SPARK-32056:
-

 Summary: Repartition by key should support partition coalesce for 
AQE
 Key: SPARK-32056
 URL: https://issues.apache.org/jira/browse/SPARK-32056
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark release 3.0.0
Reporter: koert kuipers


when adaptive query execution is enabled the following expression should 
support coalescing of partitions:
{code:java}
dataframe.repartition(col("somecolumn")) {code}
currently it does not because it simply calls the repartition implementation 
where number of partitions is specified:
{code:java}
  def repartition(partitionExprs: Column*): Dataset[T] = {
repartition(sparkSession.sessionState.conf.numShufflePartitions, 
partitionExprs: _*)
  }{code}
and repartition with the number of partitions specified does now allow for 
coalescing of partitions (since this breaks the user's expectation that it will 
have the number of partitions specified).

for more context see the discussion here:

[https://github.com/apache/spark/pull/27986]

a simple test to confirm that repartition by key does not support coalescing of 
partitions can be added in AdaptiveQueryExecSuite like this (it currently 
fails):
{code:java}
  test("SPARK-? repartition has less partitions for small data when 
adaptiveExecutionEnabled") {
Seq(true, false).foreach { enableAQE =>
  withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "50",
SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50",
SQLConf.SHUFFLE_PARTITIONS.key -> "50") {
val partitionsNum = (1 to 10).toDF.repartition($"value")
  .rdd.collectPartitions().length
if (enableAQE) {
  assert(partitionsNum < 50)
} else {
  assert(partitionsNum === 50)
}
  }
}
  }
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log

2020-06-22 Thread Zhongwei Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32044:
-
Summary: [SS] 2.4 Kafka continuous processing print mislead initial offsets 
log   (was: [SS] 2.4 Kakfa continuous processing print mislead initial offsets 
log )

> [SS] 2.4 Kafka continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32044) [SS] 2.4 Kakfa continuous processing print mislead initial offsets log

2020-06-22 Thread Zhongwei Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32044:
-
Summary: [SS] 2.4 Kakfa continuous processing print mislead initial offsets 
log   (was: [SS] Kakfa continuous processing print mislead initial offsets log )

> [SS] 2.4 Kakfa continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager



[ 
https://issues.apache.org/jira/browse/SPARK-32055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142276#comment-17142276
 ] 

Apache Spark commented on SPARK-32055:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28895

> Unify getReader and getReaderForRange in ShuffleManager
> ---
>
> Key: SPARK-32055
> URL: https://issues.apache.org/jira/browse/SPARK-32055
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Unify getReader and getReaderForRange in ShuffleManager in order to simplify 
> the implementation and ease the code maintenance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager



 [ 
https://issues.apache.org/jira/browse/SPARK-32055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32055:


Assignee: (was: Apache Spark)

> Unify getReader and getReaderForRange in ShuffleManager
> ---
>
> Key: SPARK-32055
> URL: https://issues.apache.org/jira/browse/SPARK-32055
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Unify getReader and getReaderForRange in ShuffleManager in order to simplify 
> the implementation and ease the code maintenance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager



 [ 
https://issues.apache.org/jira/browse/SPARK-32055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32055:


Assignee: Apache Spark

> Unify getReader and getReaderForRange in ShuffleManager
> ---
>
> Key: SPARK-32055
> URL: https://issues.apache.org/jira/browse/SPARK-32055
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Unify getReader and getReaderForRange in ShuffleManager in order to simplify 
> the implementation and ease the code maintenance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32038) Regression in handling NaN values in COUNT(DISTINCT)

2020-06-22 Thread Mithun Radhakrishnan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142265#comment-17142265
 ] 

Mithun Radhakrishnan commented on SPARK-32038:
--

[~viirya], [~dongjoon], thank you. I'm amazed at the quick resolution of this 
bug.

> Regression in handling NaN values in COUNT(DISTINCT)
> 
>
> Key: SPARK-32038
> URL: https://issues.apache.org/jira/browse/SPARK-32038
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mithun Radhakrishnan
>Assignee: L. C. Hsieh
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
>
> There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} 
> values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an 
> illustration:
> {code:scala}
> case class Test( uid:String, score:Float)
> val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f81)
> val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fff)
> val rows = Seq(
>  Test("mithunr",  Float.NaN), 
>  Test("mithunr",  POS_NAN_1),
>  Test("mithunr",  POS_NAN_2),
>  Test("abellina", 1.0f),
>  Test("abellina", 2.0f)
> ).toDF.createOrReplaceTempView("mytable")
> spark.sql(" select uid, count(distinct score) from mytable group by 1 order 
> by 1 asc ").show
> {code}
> Here are the results under Spark 3.0.0:
> {code:java|title=Spark 3.0.0 (single aggregation)}
> ++-+
> | uid|count(DISTINCT score)|
> ++-+
> |abellina|2|
> | mithunr|3|
> ++-+
> {code}
> Note that the count against {{mithunr}} is {{3}}, accounting for each 
> distinct value for {{NaN}}.
>  The right results are returned when another aggregation is added to the GBY:
> {code:scala|title=Spark 3.0.0 (multiple aggregations)}
> scala> spark.sql(" select uid, count(distinct score), max(score) from mytable 
> group by 1 order by 1 asc ").show
> ++-+--+
> | uid|count(DISTINCT score)|max(score)|
> ++-+--+
> |abellina|2|   2.0|
> | mithunr|1|   NaN|
> ++-+--+
> {code}
> Also, note that Spark 2.4.6 normalizes the {{DISTINCT}} expression correctly:
> {code:scala|title=Spark 2.4.6}
> scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 
> order by 1 asc ").show
> ++-+
> | uid|count(DISTINCT score)|
> ++-+
> |abellina|2|
> | mithunr|1|
> ++-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager

2020-06-22 Thread wuyi (Jira)

wuyi created SPARK-32055:


 Summary: Unify getReader and getReaderForRange in ShuffleManager
 Key: SPARK-32055
 URL: https://issues.apache.org/jira/browse/SPARK-32055
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi


Unify getReader and getReaderForRange in ShuffleManager in order to simplify 
the implementation and ease the code maintenance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have

2020-06-22 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142234#comment-17142234
 ] 

Gabor Somogyi commented on SPARK-31995:
---

At the first glance I don't think so it's a Spark but HDFS issue. Adding 
"sleep_for_a_while" on Spark side would just hide the original problem.
Please search for "Unable to close file because the last block does not have 
enough number of replicas", there are couple of hits suggestion possible 
workarounds.

I've taken a look at the jiras on hadoop side and as I've seen this has been 
resolved in 2.7.4+.
Could you reproduce the issue w/ 3.0?


> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
>

[jira] [Created] (SPARK-32054) Flaky test: org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet V2 to V1

2020-06-22 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-32054:
-

 Summary: Flaky test: 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet 
V2 to V1
 Key: SPARK-32054
 URL: https://issues.apache.org/jira/browse/SPARK-32054
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124364/testReport/org.apache.spark.sql.connector/FileDataSourceV2FallBackSuite/Fallback_Parquet_V2_to_V1/

{code:java}
Error Message
org.scalatest.exceptions.TestFailedException: 
ArrayBuffer((collect,Relation[id#387495L] parquet ), 
(save,InsertIntoHadoopFsRelationCommand 
file:/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776,
 false, Parquet, Map(path -> 
/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776),
 ErrorIfExists, [id] +- Range (0, 10, step=1, splits=Some(2)) )) had length 2 
instead of expected length 1
Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
ArrayBuffer((collect,Relation[id#387495L] parquet
), (save,InsertIntoHadoopFsRelationCommand 
file:/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776,
 false, Parquet, Map(path -> 
/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776),
 ErrorIfExists, [id]
+- Range (0, 10, step=1, splits=Some(2))
)) had length 2 instead of expected length 1
at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$22(FileDataSourceV2FallBackSuite.scala:180)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$22$adapted(FileDataSourceV2FallBackSuite.scala:176)
at 
org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
at 
org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$21(FileDataSourceV2FallBackSuite.scala:176)
at 
org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
at 
org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(FileDataSourceV2FallBackSuite.scala:85)
at 
org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246)
at 
org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.withSQLConf(FileDataSourceV2FallBackSuite.scala:85)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$20(FileDataSourceV2FallBackSuite.scala:158)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$20$adapted(FileDataSourceV2FallBackSuite.scala:157)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$19(FileDataSourceV2FallBackSuite.scala:157)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at

[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31918:
-
Priority: Blocker  (was: Major)

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL

2020-06-22 Thread Alexander Ermakov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ermakov resolved SPARK-29773.
---
Fix Version/s: 2.4.4
   Resolution: Fixed

> Unable to process empty ORC files in Hive Table using Spark SQL
> ---
>
> Key: SPARK-29773
> URL: https://issues.apache.org/jira/browse/SPARK-29773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Centos 7, Spark 2.3.1, Hive 2.3.0
>Reporter: Alexander Ermakov
>Priority: Major
> Fix For: 2.4.4
>
>
> Unable to process empty ORC files in Hive Table using Spark SQL. It seems 
> that a problem with class 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits()
> Stack trace:
> {code:java}
> 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct 
> _tech_load_dt from dl_raw.tpaccsieee_ut_data_address]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(_tech_load_dt#1374, 200)
> +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], 
> output=[_tech_load_dt#1374])
>+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation 
> `dl_raw`.`tpaccsieee_ut_data_address`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, 
> address_adm#1309, address_md#1310, adress_doc#1311, building#1312, 
> change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, 
> city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, 
> code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, 
> date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, 
> district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields]   
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324)
> at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
>

[jira] [Commented] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL

2020-06-22 Thread Alexander Ermakov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142177#comment-17142177
 ] 

Alexander Ermakov commented on SPARK-29773:
---

This issue has been resolved for Spark 2.4.4

> Unable to process empty ORC files in Hive Table using Spark SQL
> ---
>
> Key: SPARK-29773
> URL: https://issues.apache.org/jira/browse/SPARK-29773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Centos 7, Spark 2.3.1, Hive 2.3.0
>Reporter: Alexander Ermakov
>Priority: Major
>
> Unable to process empty ORC files in Hive Table using Spark SQL. It seems 
> that a problem with class 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits()
> Stack trace:
> {code:java}
> 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct 
> _tech_load_dt from dl_raw.tpaccsieee_ut_data_address]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(_tech_load_dt#1374, 200)
> +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], 
> output=[_tech_load_dt#1374])
>+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation 
> `dl_raw`.`tpaccsieee_ut_data_address`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, 
> address_adm#1309, address_md#1310, adress_doc#1311, building#1312, 
> change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, 
> city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, 
> code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, 
> date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, 
> district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields]   
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324)
> at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
>

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142127#comment-17142127
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

Just to share what I investigated:

Seems the problem relates to {{processClosure}} via {{cleanClosure}} in SparkR.
 Looks like there is a problem [when the new environment is set to a 
function|https://github.com/apache/spark/blob/master/R/pkg/R/utils.R#L601] 
especially that includes generic S4 functions, given my observation.
 So, for example, if you skip it with the fix below:
{code:java}
diff --git a/R/pkg/R/utils.R b/R/pkg/R/utils.R
index 65db9c21d9d..60cad588f5e 100644
--- a/R/pkg/R/utils.R
+++ b/R/pkg/R/utils.R
@@ -529,7 +529,9 @@ processClosure <- function(node, oldEnv, defVars, 
checkedFuncs, newEnv) {
 # Namespaces other than "SparkR" will not be searched.
 if (!isNamespace(func.env) ||
 (getNamespaceName(func.env) == "SparkR" &&
-   !(nodeChar %in% getNamespaceExports("SparkR" {
+   !(nodeChar %in% getNamespaceExports("SparkR")) &&
+  # Skip all generics under SparkR - R 4.0.0 looks having an 
issue.
+  !isGeneric(nodeChar, func.env))) {
{code}
{code:java}
* checking re-building of vignette outputs ... OK
{code}
CRAN check passes with the current master branch in my local

For a minimal reproducer, with this diff:
{code:java}
diff --git a/R/pkg/R/RDD.R b/R/pkg/R/RDD.R
index 7a1d157bb8a..89250c37319 100644
--- a/R/pkg/R/RDD.R
+++ b/R/pkg/R/RDD.R
@@ -487,6 +487,7 @@ setMethod("lapply",
 func <- function(partIndex, part) {
   lapply(part, FUN)
 }
+print(SparkR:::cleanClosure(func)(1, 2))
 lapplyPartitionsWithIndex(X, func)
   })
{code}
run:
{code:java}
createDataFrame(lapply(seq(100), function (e) list(value=e)))
{code}
When {{lapply}} is called against the RDD at {{createDataFrame}}, the cleaned 
closure's environment has SparkR's lapply as a S4 method and it leads to the 
error such as {{attempt to bind a variable to R_UnboundValue}}.

Hopefully this is the cause of the issue happening here, and not an issue in my 
env. cc [~felixcheung], [~dongjoon] FYI.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142073#comment-17142073
 ] 

Hyukjin Kwon edited comment on SPARK-31918 at 6/22/20, 2:10 PM:


It affects Spark 3.0 too, and seems failing with a different message in my 
local:

{code}
* creating vignettes ... ERROR
--- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
  Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R 
Markdown v1.

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Picked up _JAVA_OPTIONS: -XX:-UsePerfData
Picked up _JAVA_OPTIONS: -XX:-UsePerfData
20/06/22 15:07:34 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).

[Stage 0:>  (0 + 1) / 1]
20/06/22 15:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in lapply(part, FUN) : attempt to bind a 
variable to R_UnboundValue
{code}

Assuming the errors from R execution itself, the root cause might be same.


was (Author: hyukjin.kwon):
It affects Spark 3.0 too, and seems failing with a different message in my 
local:

{code}
* creating vignettes ... ERROR
--- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
  Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R 
Markdown v1.

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Picked up _JAVA_OPTIONS: -XX:-UsePerfData
Picked up _JAVA_OPTIONS: -XX:-UsePerfData
20/06/22 15:07:34 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).

[Stage 0:>  (0 + 1) / 1]
20/06/22 15:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: R unexpectedly exited.
{code}

Assuming the errors from R execution itself, the root cause might be same.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX



[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142073#comment-17142073
 ] 

Hyukjin Kwon commented on SPARK-31918:
--

It affects Spark 3.0 too, and seems failing with a different message in my 
local:

{code}
* creating vignettes ... ERROR
--- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
  Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R 
Markdown v1.

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Picked up _JAVA_OPTIONS: -XX:-UsePerfData
Picked up _JAVA_OPTIONS: -XX:-UsePerfData
20/06/22 15:07:34 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).

[Stage 0:>  (0 + 1) / 1]
20/06/22 15:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: R unexpectedly exited.
{code}

Assuming the errors from R execution itself, the root cause might be same.

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX