[jira] [Created] (SPARK-32068) Spark 3 UI task launch time show in error time zone
Smith Cruise created SPARK-32068: Summary: Spark 3 UI task launch time show in error time zone Key: SPARK-32068 URL: https://issues.apache.org/jira/browse/SPARK-32068 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: Smith Cruise !image-2020-06-23-13-53-24-417.png|width=2965,height=603! Here show at the correct time(In UTS), but if I enter the stage to see task list. !image-2020-06-23-13-55-29-991.png! task launch time is different from before(In UTC). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26905: Fix Version/s: 3.1.0 > Revisit reserved/non-reserved keywords based on the ANSI SQL standard > - > > Key: SPARK-26905 > URL: https://issues.apache.org/jira/browse/SPARK-26905 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.1, 3.1.0 > > Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, > spark-nonReserved.txt, spark-strictNonReserved.txt, > sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, > sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, > sql2016-14-nonreserved.txt, sql2016-14-reserved.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31950) Extract SQL keywords from the generated parser class in TableIdentifierParserSuite
[ https://issues.apache.org/jira/browse/SPARK-31950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31950: Fix Version/s: 3.1.0 > Extract SQL keywords from the generated parser class in > TableIdentifierParserSuite > -- > > Key: SPARK-31950 > URL: https://issues.apache.org/jira/browse/SPARK-31950 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31230) use statement plans in DataFrameWriter(V2)
[ https://issues.apache.org/jira/browse/SPARK-31230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31230: Fix Version/s: 3.1.0 > use statement plans in DataFrameWriter(V2) > -- > > Key: SPARK-31230 > URL: https://issues.apache.org/jira/browse/SPARK-31230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31584: Fix Version/s: 3.1.0 > NullPointerException when parsing event log with InMemoryStore > -- > > Key: SPARK-31584 > URL: https://issues.apache.org/jira/browse/SPARK-31584 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > Attachments: errorstack.txt > > > I compiled with the current branch-3.0 source and tested it in mac os. A > java.lang.NullPointerException will be thrown when below conditions are met: > # Using InMemoryStore as kvstore when parsing the event log file (e.g., when > spark.history.store.path is unset). > # At least one stage in this event log has task number greater than > spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to > delete extra task records. > # The job has more than one stage, so parentToChildrenMap in > InMemoryStore.java will have more than one key. > The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In > the method deleteParentIndex(). > {code:java} > private void deleteParentIndex(Object key) { > if (hasNaturalParentIndex) { > for (NaturalKeys v : parentToChildrenMap.values()) { > if (v.remove(asKey(key))) { > // `v` can be empty after removing the natural key and we can > remove it from > // `parentToChildrenMap`. However, `parentToChildrenMap` is a > ConcurrentMap and such > // checking and deleting can be slow. > // This method is to delete one object with certain key, let's > make it simple here. > break; > } > } > } > }{code} > In “if (v.remove(asKey(key)))”, if the key is not contained in v, > "v.remove(asKey(key))" will return null, and java will throw a > NullPointerException when executing "if (null)". > An exception stack trace is attached. > This issue can be fixed by updating if statement to > {code:java} > if (v.remove(asKey(key)) != null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31918: Target Version/s: 3.0.1 > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29226) Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.
[ https://issues.apache.org/jira/browse/SPARK-29226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142631#comment-17142631 ] ThimmeGowda commented on SPARK-29226: - Hi, am using spark 2.4.5 and can i upgrade jackson-databind from 2.6.7.3 to 2.9.10 ? I tried changing as above in all files mentioned in PR, but got compilataion error for spark-core. maven-dependency-plugin:3.0.2:build-classpath (default-cli) @ spark-core_2.11 The dependency classpath does not have scala-reflect-2.11.12.jar Thanks > Upgrade jackson-databind to 2.9.10 and fix vulnerabilities. > --- > > Key: SPARK-29226 > URL: https://issues.apache.org/jira/browse/SPARK-29226 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0 > > > The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 > and it will cause a security vulnerabilities. We could get some security info > from https://www.tenable.com/cve/CVE-2019-16335 > This reference remind to upgrate the version of `jackson-databind` to 2.9.10 > or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142620#comment-17142620 ] Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 4:35 AM: I tested it manually with the fix I mentioned [here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127] .. let me test that case too. BTW, I just roughly tested instead of running the full tests. Some corner cases might not work when running SparkR built by R 4.0.1 on R 3.6.3. Let me test a bit more closely and share the results later. was (Author: hyukjin.kwon): I tested it manually with the fix I mentioned [here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127] .. let me test that case too. BTW, I just roughly tested instead of running the full tests. Some corner cases might not work when running SparkR built by R 4.0.1 on R 3.6.3. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142620#comment-17142620 ] Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 4:35 AM: I tested it manually with the fix I mentioned [here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127] .. let me test that case too. BTW, I just roughly tested instead of running the full tests. Some corner cases might not work when running SparkR built by R 4.0.1 on R 3.6.3. was (Author: hyukjin.kwon): I tested it manually with the fix I mentioned [here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127] .. let me test that case too. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142620#comment-17142620 ] Hyukjin Kwon commented on SPARK-31918: -- I tested it manually with the fix I mentioned [here|https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142127=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17142127] .. let me test that case too. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142618#comment-17142618 ] Shivaram Venkataraman commented on SPARK-31918: --- Thats great! [~hyukjin.kwon] -- so we can get around the installation issue if we can build on R 4.0.0. However I guess we will still have the the serialization issue. BTW does the serialization issue go away if we build in R 4.0.0 and run with R 3.6.3? > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31801) Register shuffle map output metadata with a shuffle output tracker
[ https://issues.apache.org/jira/browse/SPARK-31801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142616#comment-17142616 ] Apache Spark commented on SPARK-31801: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/28902 > Register shuffle map output metadata with a shuffle output tracker > -- > > Key: SPARK-31801 > URL: https://issues.apache.org/jira/browse/SPARK-31801 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > > Part of the design as discussed in [this > document|https://docs.google.com/document/d/1Aj6IyMsbS2sdIfHxLvIbHUNjHIWHTabfknIPoxOrTjk/edit#]. > Establish a {{ShuffleOutputTracker}} API that resides on the driver, and > handle accepting map output metadata returned by the map output writers and > send them to the output tracker module accordingly. > Requires https://issues.apache.org/jira/browse/SPARK-31798. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142605#comment-17142605 ] Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 4:08 AM: Okay, [~shivaram], the first option seems working although it shows a warning such as below. I built Spark 3.0.0 with R 4.0.1, and manually downgraded to R 3.6.3. {code:java} During startup - Warning message: package ‘SparkR’ was built under R version 4.0.1 {code} I removed unrelated comments I left above. was (Author: hyukjin.kwon): Okay, [~shivaram], the first option seems working although it shows a warning such as below. I build Spark 3.0.0 with 4.0.1, and manually downgraded to R 3.6.3. {code:java} During startup - Warning message: package ‘SparkR’ was built under R version 4.0.1 {code} I removed unrelated comments I left above. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31918: - Comment: was deleted (was: Nice, [~shivaram]. I just quickly tested, and the first option is not working. 1. Build Spark 3.0.0 in R 4.0.1 and install it from source with R 3.4.0 in another machine: {code} install.packages("SparkR_3.0.0.tar.gz", repos = NULL, type = "source") {code} {code} df <- createDataFrame(lapply(seq(100), function (e) list(value=e))) count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df))) {code} It shows the same error as shown in https://cran.r-project.org/web/checks/check_results_SparkR.html 2. Build Spark 3.0.0 in R 4.0.1, loads the library directly with R 3.4.0 in another machine: {code} library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", "lib"))) {code} {code} # this error message is translated from another language. My R in Mac is in Korean Error listing packages, Error in readRDS(pfile): cannot read workspace version 3 written by R 4.0.1. R version should be 3.5+ {code} 3. Download Spark 3.0.0 release, loads the library directly with R 3.4.0 in another machine: {code} library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", "lib"))) {code} {code} # this error message is translated from another language. My R in Mac is in Korean Error listing packages, Error in readRDS(pfile): cannot read workspace version 3 written by R 3.6.3. R version should be 3.5+ {code} ) > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31918: - Comment: was deleted (was: Oh, wait, the worker should test SparkR built with R 4.0.1. In the first case, I guess R worker loaded the one from 3.0.0 download (which is R 3.6.3). Let me test it via overwriting it.) > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142605#comment-17142605 ] Hyukjin Kwon commented on SPARK-31918: -- Okay, [~shivaram], the first option seems working although it shows a warning such as below. I build Spark 3.0.0 with 4.0.1, and manually downgraded to R 3.6.3. {code:java} During startup - Warning message: package ‘SparkR’ was built under R version 4.0.1 {code} I removed unrelated comments I left above. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32064) Supporting create temporary table
[ https://issues.apache.org/jira/browse/SPARK-32064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32064: Assignee: Apache Spark > Supporting create temporary table > - > > Key: SPARK-32064 > URL: https://issues.apache.org/jira/browse/SPARK-32064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > The basic code to implement the Spark native temporary table. See SPARK-32063 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32064) Supporting create temporary table
[ https://issues.apache.org/jira/browse/SPARK-32064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142603#comment-17142603 ] Apache Spark commented on SPARK-32064: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/28901 > Supporting create temporary table > - > > Key: SPARK-32064 > URL: https://issues.apache.org/jira/browse/SPARK-32064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > The basic code to implement the Spark native temporary table. See SPARK-32063 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32064) Supporting create temporary table
[ https://issues.apache.org/jira/browse/SPARK-32064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32064: Assignee: (was: Apache Spark) > Supporting create temporary table > - > > Key: SPARK-32064 > URL: https://issues.apache.org/jira/browse/SPARK-32064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > The basic code to implement the Spark native temporary table. See SPARK-32063 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142591#comment-17142591 ] Hyukjin Kwon commented on SPARK-31918: -- Oh, wait, the worker should test SparkR built with R 4.0.1. In the first case, I guess R worker loaded the one from 3.0.0 download (which is R 3.6.3). Let me test it via overwriting it. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Summary: [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission (was: [K8s] Pod template from subsequently submission inadvertently applies to the ongoing submission) > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different pod templates to K8s sequentially, and app2 launches while > app1 is still ramping up all its executor pods. The unwanted result is that > some launched executor pods of app1 appear to have app2's pod template > applied. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the launching period because the configmap names of the two apps are > the same. This causes some app1's executor pods being ramped up after app2 is > launched to be inadvertently launched with the app2's pod template. > First, submit app1 > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its executors > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default podspec-configmap 1 13m57s{code} > > PROPOSED SOLUTION: > Properly prefix the podspec-configmap for each submitted app. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default app1--podspec-configmap1 13m57s > default app2--podspec-configmap1 13m57s{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to the ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different pod templates to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the launching period because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. First, submit app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different pod templates to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the launching period because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. # Launch app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} # Then launch app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to the > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different pod templates to K8s sequentially, and app2 launches while > app1 is still ramping up all its executor pods. The unwanted result is that > some launched executor pods of app1 appear to have app2's pod template > applied. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the launching period because the configmap names of the two apps are > the same. This causes some app1's executor pods being ramped up after app2 is > launched to be inadvertently launched with the app2's pod template. > First, submit app1 > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its
[jira] [Comment Edited] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417 ] YoungGyu Chun edited comment on SPARK-31998 at 6/23/20, 3:14 AM: - [~fan_li_ya] [~kou] I assume that this change will be applied to v1,0. Let us know when v1.0 will be released. was (Author: younggyuchun): [~fan_li_ya] [~kou] let us know when v1.0 will be released. > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to the ongoing submission
James Yu created SPARK-32067: Summary: [K8s] Pod template from subsequently submission inadvertently applies to the ongoing submission Key: SPARK-32067 URL: https://issues.apache.org/jira/browse/SPARK-32067 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0, 2.4.6 Reporter: James Yu THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different pod templates to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the launching period because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. # Launch app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} # Then launch app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417 ] YoungGyu Chun edited comment on SPARK-31998 at 6/23/20, 3:12 AM: - [~fan_li_ya] [~kou] let us know when v1.0 will be released. was (Author: younggyuchun): I will be working on this when v1.0 is out > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32059: Issue Type: Improvement (was: Bug) > Nested Schema Pruning not Working in Window Functions > - > > Key: SPARK-32059 > URL: https://issues.apache.org/jira/browse/SPARK-32059 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Yin >Priority: Major > > Using tables and data structures in `SchemaPruningSuite.scala` > > {code:java} > // code placeholder > case class FullName(first: String, middle: String, last: String) > case class Company(name: String, address: String) > case class Employer(id: Int, company: Company) > case class Contact( > id: Int, > name: FullName, > address: String, > pets: Int, > friends: Array[FullName] = Array.empty, > relatives: Map[String, FullName] = Map.empty, > employer: Employer = null, > relations: Map[FullName, String] = Map.empty) > case class Department( > depId: Int, > depName: String, > contactId: Int, > employer: Employer) > {code} > > The query to run: > {code:java} > // code placeholder > select a.name.first from (select row_number() over (partition by address > order by id desc) as __rank, contacts.* from contacts) a where a.name.first = > 'A' AND a.__rank = 1 > {code} > > The current physical plan: > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [name#46.first AS first#74] > +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND > (name#46.first = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46, address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} > > The desired physical plan: > > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [_gen_alias_77#77 AS first#74] > +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND > (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, > address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32056) Repartition by key should support partition coalesce for AQE
[ https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142571#comment-17142571 ] Apache Spark commented on SPARK-32056: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28900 > Repartition by key should support partition coalesce for AQE > > > Key: SPARK-32056 > URL: https://issues.apache.org/jira/browse/SPARK-32056 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark release 3.0.0 >Reporter: koert kuipers >Priority: Minor > > when adaptive query execution is enabled the following expression should > support coalescing of partitions: > {code:java} > dataframe.repartition(col("somecolumn")) {code} > currently it does not because it simply calls the repartition implementation > where number of partitions is specified: > {code:java} > def repartition(partitionExprs: Column*): Dataset[T] = { > repartition(sparkSession.sessionState.conf.numShufflePartitions, > partitionExprs: _*) > }{code} > and repartition with the number of partitions specified does now allow for > coalescing of partitions (since this breaks the user's expectation that it > will have the number of partitions specified). > for more context see the discussion here: > [https://github.com/apache/spark/pull/27986] > a simple test to confirm that repartition by key does not support coalescing > of partitions can be added in AdaptiveQueryExecSuite like this (it currently > fails): > {code:java} > test("SPARK-32056 repartition has less partitions for small data when > adaptiveExecutionEnabled") { > Seq(true, false).foreach { enableAQE => > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "50", > SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "50") { > val partitionsNum = (1 to 10).toDF.repartition($"value") > .rdd.collectPartitions().length > if (enableAQE) { > assert(partitionsNum < 50) > } else { > assert(partitionsNum === 50) > } > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32056) Repartition by key should support partition coalesce for AQE
[ https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32056: Assignee: Apache Spark > Repartition by key should support partition coalesce for AQE > > > Key: SPARK-32056 > URL: https://issues.apache.org/jira/browse/SPARK-32056 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark release 3.0.0 >Reporter: koert kuipers >Assignee: Apache Spark >Priority: Minor > > when adaptive query execution is enabled the following expression should > support coalescing of partitions: > {code:java} > dataframe.repartition(col("somecolumn")) {code} > currently it does not because it simply calls the repartition implementation > where number of partitions is specified: > {code:java} > def repartition(partitionExprs: Column*): Dataset[T] = { > repartition(sparkSession.sessionState.conf.numShufflePartitions, > partitionExprs: _*) > }{code} > and repartition with the number of partitions specified does now allow for > coalescing of partitions (since this breaks the user's expectation that it > will have the number of partitions specified). > for more context see the discussion here: > [https://github.com/apache/spark/pull/27986] > a simple test to confirm that repartition by key does not support coalescing > of partitions can be added in AdaptiveQueryExecSuite like this (it currently > fails): > {code:java} > test("SPARK-32056 repartition has less partitions for small data when > adaptiveExecutionEnabled") { > Seq(true, false).foreach { enableAQE => > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "50", > SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "50") { > val partitionsNum = (1 to 10).toDF.repartition($"value") > .rdd.collectPartitions().length > if (enableAQE) { > assert(partitionsNum < 50) > } else { > assert(partitionsNum === 50) > } > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32056) Repartition by key should support partition coalesce for AQE
[ https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32056: Assignee: (was: Apache Spark) > Repartition by key should support partition coalesce for AQE > > > Key: SPARK-32056 > URL: https://issues.apache.org/jira/browse/SPARK-32056 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark release 3.0.0 >Reporter: koert kuipers >Priority: Minor > > when adaptive query execution is enabled the following expression should > support coalescing of partitions: > {code:java} > dataframe.repartition(col("somecolumn")) {code} > currently it does not because it simply calls the repartition implementation > where number of partitions is specified: > {code:java} > def repartition(partitionExprs: Column*): Dataset[T] = { > repartition(sparkSession.sessionState.conf.numShufflePartitions, > partitionExprs: _*) > }{code} > and repartition with the number of partitions specified does now allow for > coalescing of partitions (since this breaks the user's expectation that it > will have the number of partitions specified). > for more context see the discussion here: > [https://github.com/apache/spark/pull/27986] > a simple test to confirm that repartition by key does not support coalescing > of partitions can be added in AdaptiveQueryExecSuite like this (it currently > fails): > {code:java} > test("SPARK-32056 repartition has less partitions for small data when > adaptiveExecutionEnabled") { > Seq(true, false).foreach { enableAQE => > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "50", > SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "50") { > val partitionsNum = (1 to 10).toDF.repartition($"value") > .rdd.collectPartitions().length > if (enableAQE) { > assert(partitionsNum < 50) > } else { > assert(partitionsNum === 50) > } > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32056) Repartition by key should support partition coalesce for AQE
[ https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142570#comment-17142570 ] Apache Spark commented on SPARK-32056: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28900 > Repartition by key should support partition coalesce for AQE > > > Key: SPARK-32056 > URL: https://issues.apache.org/jira/browse/SPARK-32056 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark release 3.0.0 >Reporter: koert kuipers >Priority: Minor > > when adaptive query execution is enabled the following expression should > support coalescing of partitions: > {code:java} > dataframe.repartition(col("somecolumn")) {code} > currently it does not because it simply calls the repartition implementation > where number of partitions is specified: > {code:java} > def repartition(partitionExprs: Column*): Dataset[T] = { > repartition(sparkSession.sessionState.conf.numShufflePartitions, > partitionExprs: _*) > }{code} > and repartition with the number of partitions specified does now allow for > coalescing of partitions (since this breaks the user's expectation that it > will have the number of partitions specified). > for more context see the discussion here: > [https://github.com/apache/spark/pull/27986] > a simple test to confirm that repartition by key does not support coalescing > of partitions can be added in AdaptiveQueryExecSuite like this (it currently > fails): > {code:java} > test("SPARK-32056 repartition has less partitions for small data when > adaptiveExecutionEnabled") { > Seq(true, false).foreach { enableAQE => > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "50", > SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "50") { > val partitionsNum = (1 to 10).toDF.repartition($"value") > .rdd.collectPartitions().length > if (enableAQE) { > assert(partitionsNum < 50) > } else { > assert(partitionsNum === 50) > } > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32065) Supporting analyze temporary table
Lantao Jin created SPARK-32065: -- Summary: Supporting analyze temporary table Key: SPARK-32065 URL: https://issues.apache.org/jira/browse/SPARK-32065 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Lantao Jin Supporting analyze temporary table -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142566#comment-17142566 ] Hyukjin Kwon commented on SPARK-31918: -- Nice, [~shivaram]. I just quickly tested, and the first option is not working. 1. Build Spark 3.0.0 in R 4.0.1 and install it from source with R 3.4.0 in another machine: {code} install.packages("SparkR_3.0.0.tar.gz", repos = NULL, type = "source") {code} {code} df <- createDataFrame(lapply(seq(100), function (e) list(value=e))) count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df))) {code} It shows the same error as shown in https://cran.r-project.org/web/checks/check_results_SparkR.html 2. Build Spark 3.0.0 in R 4.0.1, loads the library directly with R 3.4.0 in another machine: {code} library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", "lib"))) {code} {code} # this error message is translated from another language. My R in Mac is in Korean Error listing packages, Error in readRDS(pfile): cannot read workspace version 3 written by R 4.0.1. R version should be 3.5+ {code} 3. Download Spark 3.0.0 release, loads the library directly with R 3.4.0 in another machine: {code} library(SparkR, lib.loc = c(file.path("~/spark-3.0.0-bin-hadoop2.7", "R", "lib"))) {code} {code} # this error message is translated from another language. My R in Mac is in Korean Error listing packages, Error in readRDS(pfile): cannot read workspace version 3 written by R 3.6.3. R version should be 3.5+ {code} > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32066) Supporting create temporary table LIKE
Lantao Jin created SPARK-32066: -- Summary: Supporting create temporary table LIKE Key: SPARK-32066 URL: https://issues.apache.org/jira/browse/SPARK-32066 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Lantao Jin Supporting create temporary table LIKE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32064) Supporting create temporary table
Lantao Jin created SPARK-32064: -- Summary: Supporting create temporary table Key: SPARK-32064 URL: https://issues.apache.org/jira/browse/SPARK-32064 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Lantao Jin The basic code to implement the Spark native temporary table. See SPARK-32063 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142558#comment-17142558 ] Shivaram Venkataraman commented on SPARK-31918: --- I can confirm that with build from source of Spark 3.0.0 and R 4.0.2, I see the following error while building vignettes. {{R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue}} > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32062) Reset listenerRegistered in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32062: Assignee: Apache Spark > Reset listenerRegistered in SparkSession > > > Key: SPARK-32062 > URL: https://issues.apache.org/jira/browse/SPARK-32062 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32062) Reset listenerRegistered in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32062: Assignee: (was: Apache Spark) > Reset listenerRegistered in SparkSession > > > Key: SPARK-32062 > URL: https://issues.apache.org/jira/browse/SPARK-32062 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32062) Reset listenerRegistered in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142552#comment-17142552 ] Apache Spark commented on SPARK-32062: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28899 > Reset listenerRegistered in SparkSession > > > Key: SPARK-32062 > URL: https://issues.apache.org/jira/browse/SPARK-32062 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32062) Reset listenerRegistered in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142553#comment-17142553 ] Apache Spark commented on SPARK-32062: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28899 > Reset listenerRegistered in SparkSession > > > Key: SPARK-32062 > URL: https://issues.apache.org/jira/browse/SPARK-32062 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32063) Spark native temporary table
Lantao Jin created SPARK-32063: -- Summary: Spark native temporary table Key: SPARK-32063 URL: https://issues.apache.org/jira/browse/SPARK-32063 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Lantao Jin Many databases and data warehouse SQL engines support temporary tables. A temporary table, as its named implied, is a short-lived table that its life will be only for current session. In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS SELECT” will create a temporary view. A temporary view is totally different with a temporary table. A temporary view is just a VIEW. It doesn’t materialize data in storage. So it has below shortage: # View will not give improved performance. Materialize intermediate data in temporary tables for a complex query will accurate queries, especially in an ETL pipeline. # View which calls other views can cause severe performance issues. Even, executing a very complex view may fail in Spark. # Temporary view has no database namespace. In some complex ETL pipelines or data warehouse applications, without database prefix is not convenient. It needs some tables which only used in current session. More details are described in [Design Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32062) Reset listenerRegistered in SparkSession
ulysses you created SPARK-32062: --- Summary: Reset listenerRegistered in SparkSession Key: SPARK-32062 URL: https://issues.apache.org/jira/browse/SPARK-32062 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142533#comment-17142533 ] Toby Harradine edited comment on SPARK-25244 at 6/23/20, 2:09 AM: -- Hi, I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a difficult bug to work around when trying to validate datetimes in unit tests, which run on different machines with different timezones (and I'd prefer not to require use of Pandas to run unit tests). Was this issue closed without resolution? _Edit: Just tested on PySpark 3.0.0 with same outcome_. Regards, Toby was (Author: toby.harradine): Hi, I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a difficult bug to work around when trying to validate datetimes in unit tests, which run on different machines with different timezones (and I'd prefer not to require use of Pandas to run unit tests). Was this issue closed without resolution? Regards, Toby > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142533#comment-17142533 ] Toby Harradine commented on SPARK-25244: Hi, I've just come across this issue in PySpark 2.4.6 (Spark 2.4.4), quite a difficult bug to work around when trying to validate datetimes in unit tests, which run on different machines with different timezones (and I'd prefer not to require use of Pandas to run unit tests). Was this issue closed without resolution? Regards, Toby > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142532#comment-17142532 ] Shivaram Venkataraman commented on SPARK-31918: --- [~hyukjin.kwon] I have R 4.0.2 and will try to do a fresh build from source of Spark 3.0.0 > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32061) potential regression if use memoryUsage instead of numRows
zhengruifeng created SPARK-32061: Summary: potential regression if use memoryUsage instead of numRows Key: SPARK-32061 URL: https://issues.apache.org/jira/browse/SPARK-32061 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.1.0 Reporter: zhengruifeng 1, if the `memoryUsage` is improperly set, for example, too small to store a instance; 2, the blockify+GMM reuse two matrices whose shape is related to current blockSize: {code:java} @transient private lazy val auxiliaryProbMat = DenseMatrix.zeros(blockSize, k) @transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures) {code} When implementing blockify+GMM, I found that if I do not pre-allocate those matrices, there will be seriously regression (maybe 3~4 slower, I fogot the detailed numbers); 3, in MLP, three pre-allocated objects are also related to numRows: {code:java} if (ones == null || ones.length != delta.cols) ones = BDV.ones[Double](delta.cols) // TODO: allocate outputs as one big array and then create BDMs from it if (outputs == null || outputs(0).cols != currentBatchSize) { ... // TODO: allocate deltas as one big array and then create BDMs from it if (deltas == null || deltas(0).cols != currentBatchSize) { deltas = new Array[BDM[Double]](layerModels.length) ... {code} I am not very familiar with the impl of MLP and failed to find some related document about this pro-allocation. But I guess there maybe regression if we disable this pro-allocation, since those objects look relatively big. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142520#comment-17142520 ] Dongjoon Hyun commented on SPARK-31918: --- Unfortunately, no~ I downgraded to R 3.5.2 on both my MacPro and MacBook. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Description: |performace test in https://issues.apache.org/jira/browse/SPARK-31783, Huber loss seems start to diverge since 50 iters. {code:java} for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { Thread.sleep(1) val hlir = new LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) val start = System.currentTimeMillis val model = hlir.setBlockSize(size).fit(df) val end = System.currentTimeMillis println((model.uid, size, iter, end - start, model.summary.objectiveHistory.last, model.summary.totalIterations, model.coefficients.toString.take(100))) }{code}| | | | | | | | | | | | | | | | | |result:| |blockSize=1| |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| blockSize=4| |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| blockSize=16| |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| blockSize=64| |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| was: |Huber loss seems start to diverge since 50 iters. {code:java} for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { Thread.sleep(1) val hlir = new LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) val start = System.currentTimeMillis val model = hlir.setBlockSize(size).fit(df) val end = System.currentTimeMillis println((model.uid, size, iter, end - start, model.summary.objectiveHistory.last, model.summary.totalIterations, model.coefficients.toString.take(100))) }{code}| | | | | | | | | | | | | | | | | |result:| |blockSize=1| |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| | blockSize=4| |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| | blockSize=16| |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| | blockSize=64|
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Parent: SPARK-30641 Issue Type: Sub-task (was: Bug) > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 50 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32060) Huber loss Convergence
zhengruifeng created SPARK-32060: Summary: Huber loss Convergence Key: SPARK-32060 URL: https://issues.apache.org/jira/browse/SPARK-32060 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng |Huber loss seems start to diverge since 50 iters. {code:java} for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { Thread.sleep(1) val hlir = new LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) val start = System.currentTimeMillis val model = hlir.setBlockSize(size).fit(df) val end = System.currentTimeMillis println((model.uid, size, iter, end - start, model.summary.objectiveHistory.last, model.summary.totalIterations, model.coefficients.toString.take(100))) }{code}| | | | | | | | | | | | | | | | | |result:| |blockSize=1| |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| | blockSize=4| |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| | blockSize=16| |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| | blockSize=64| |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32059: Assignee: (was: Apache Spark) > Nested Schema Pruning not Working in Window Functions > - > > Key: SPARK-32059 > URL: https://issues.apache.org/jira/browse/SPARK-32059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Yin >Priority: Major > > Using tables and data structures in `SchemaPruningSuite.scala` > > {code:java} > // code placeholder > case class FullName(first: String, middle: String, last: String) > case class Company(name: String, address: String) > case class Employer(id: Int, company: Company) > case class Contact( > id: Int, > name: FullName, > address: String, > pets: Int, > friends: Array[FullName] = Array.empty, > relatives: Map[String, FullName] = Map.empty, > employer: Employer = null, > relations: Map[FullName, String] = Map.empty) > case class Department( > depId: Int, > depName: String, > contactId: Int, > employer: Employer) > {code} > > The query to run: > {code:java} > // code placeholder > select a.name.first from (select row_number() over (partition by address > order by id desc) as __rank, contacts.* from contacts) a where a.name.first = > 'A' AND a.__rank = 1 > {code} > > The current physical plan: > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [name#46.first AS first#74] > +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND > (name#46.first = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46, address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} > > The desired physical plan: > > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [_gen_alias_77#77 AS first#74] > +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND > (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, > address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142515#comment-17142515 ] Apache Spark commented on SPARK-32059: -- User 'frankyin-factual' has created a pull request for this issue: https://github.com/apache/spark/pull/28898 > Nested Schema Pruning not Working in Window Functions > - > > Key: SPARK-32059 > URL: https://issues.apache.org/jira/browse/SPARK-32059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Yin >Priority: Major > > Using tables and data structures in `SchemaPruningSuite.scala` > > {code:java} > // code placeholder > case class FullName(first: String, middle: String, last: String) > case class Company(name: String, address: String) > case class Employer(id: Int, company: Company) > case class Contact( > id: Int, > name: FullName, > address: String, > pets: Int, > friends: Array[FullName] = Array.empty, > relatives: Map[String, FullName] = Map.empty, > employer: Employer = null, > relations: Map[FullName, String] = Map.empty) > case class Department( > depId: Int, > depName: String, > contactId: Int, > employer: Employer) > {code} > > The query to run: > {code:java} > // code placeholder > select a.name.first from (select row_number() over (partition by address > order by id desc) as __rank, contacts.* from contacts) a where a.name.first = > 'A' AND a.__rank = 1 > {code} > > The current physical plan: > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [name#46.first AS first#74] > +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND > (name#46.first = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46, address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} > > The desired physical plan: > > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [_gen_alias_77#77 AS first#74] > +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND > (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, > address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32059: Assignee: Apache Spark > Nested Schema Pruning not Working in Window Functions > - > > Key: SPARK-32059 > URL: https://issues.apache.org/jira/browse/SPARK-32059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Yin >Assignee: Apache Spark >Priority: Major > > Using tables and data structures in `SchemaPruningSuite.scala` > > {code:java} > // code placeholder > case class FullName(first: String, middle: String, last: String) > case class Company(name: String, address: String) > case class Employer(id: Int, company: Company) > case class Contact( > id: Int, > name: FullName, > address: String, > pets: Int, > friends: Array[FullName] = Array.empty, > relatives: Map[String, FullName] = Map.empty, > employer: Employer = null, > relations: Map[FullName, String] = Map.empty) > case class Department( > depId: Int, > depName: String, > contactId: Int, > employer: Employer) > {code} > > The query to run: > {code:java} > // code placeholder > select a.name.first from (select row_number() over (partition by address > order by id desc) as __rank, contacts.* from contacts) a where a.name.first = > 'A' AND a.__rank = 1 > {code} > > The current physical plan: > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [name#46.first AS first#74] > +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND > (name#46.first = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46, address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} > > The desired physical plan: > > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [_gen_alias_77#77 AS first#74] > +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND > (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, > address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142514#comment-17142514 ] Apache Spark commented on SPARK-32059: -- User 'frankyin-factual' has created a pull request for this issue: https://github.com/apache/spark/pull/28898 > Nested Schema Pruning not Working in Window Functions > - > > Key: SPARK-32059 > URL: https://issues.apache.org/jira/browse/SPARK-32059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Yin >Priority: Major > > Using tables and data structures in `SchemaPruningSuite.scala` > > {code:java} > // code placeholder > case class FullName(first: String, middle: String, last: String) > case class Company(name: String, address: String) > case class Employer(id: Int, company: Company) > case class Contact( > id: Int, > name: FullName, > address: String, > pets: Int, > friends: Array[FullName] = Array.empty, > relatives: Map[String, FullName] = Map.empty, > employer: Employer = null, > relations: Map[FullName, String] = Map.empty) > case class Department( > depId: Int, > depName: String, > contactId: Int, > employer: Employer) > {code} > > The query to run: > {code:java} > // code placeholder > select a.name.first from (select row_number() over (partition by address > order by id desc) as __rank, contacts.* from contacts) a where a.name.first = > 'A' AND a.__rank = 1 > {code} > > The current physical plan: > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [name#46.first AS first#74] > +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND > (name#46.first = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46, address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} > > The desired physical plan: > > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [_gen_alias_77#77 AS first#74] > +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND > (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, > address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27148) Support CURRENT_TIME and LOCALTIME when ANSI mode enabled
[ https://issues.apache.org/jira/browse/SPARK-27148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YoungGyu Chun resolved SPARK-27148. --- Resolution: Later > Support CURRENT_TIME and LOCALTIME when ANSI mode enabled > - > > Key: SPARK-27148 > URL: https://issues.apache.org/jira/browse/SPARK-27148 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > CURRENT_TIME and LOCALTIME should be supported in the ANSI standard; > {code:java} > postgres=# select CURRENT_TIME; > timetz > > 16:45:43.398109+09 > (1 row) > postgres=# select LOCALTIME; > time > > 16:45:48.60969 > (1 row){code} > Before this, we need to support TIME types (java.sql.Time). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27148) Support CURRENT_TIME and LOCALTIME when ANSI mode enabled
[ https://issues.apache.org/jira/browse/SPARK-27148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142511#comment-17142511 ] YoungGyu Chun commented on SPARK-27148: --- A similar ticket was concluded by the following comment by [~rxin]. Let's close this for now [#25678 (comment)|https://github.com/apache/spark/pull/25678#issuecomment-531585556] > Support CURRENT_TIME and LOCALTIME when ANSI mode enabled > - > > Key: SPARK-27148 > URL: https://issues.apache.org/jira/browse/SPARK-27148 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > CURRENT_TIME and LOCALTIME should be supported in the ANSI standard; > {code:java} > postgres=# select CURRENT_TIME; > timetz > > 16:45:43.398109+09 > (1 row) > postgres=# select LOCALTIME; > time > > 16:45:48.60969 > (1 row){code} > Before this, we need to support TIME types (java.sql.Time). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7101) Spark SQL should support java.sql.Time
[ https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YoungGyu Chun resolved SPARK-7101. -- Resolution: Later > Spark SQL should support java.sql.Time > -- > > Key: SPARK-7101 > URL: https://issues.apache.org/jira/browse/SPARK-7101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 > Environment: All >Reporter: Peter Hagelund >Priority: Major > > Several RDBMSes support the TIME data type; for more exact mapping between > those and Spark SQL, support for java.sql.Time with an associated > DataType.TimeType would be helpful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time
[ https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142510#comment-17142510 ] YoungGyu Chun commented on SPARK-7101: -- A similar ticket was concluded by the following comment by [~rxin]. Let's close this for now [#25678 (comment)|https://github.com/apache/spark/pull/25678#issuecomment-531585556] > Spark SQL should support java.sql.Time > -- > > Key: SPARK-7101 > URL: https://issues.apache.org/jira/browse/SPARK-7101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 > Environment: All >Reporter: Peter Hagelund >Priority: Major > > Several RDBMSes support the TIME data type; for more exact mapping between > those and Spark SQL, support for java.sql.Time with an associated > DataType.TimeType would be helpful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. However, this query is only for reading data, and it does not support the common examples listed above. On the other hand, there is `sessionInitStatement` option available before writing data from DataFrame. This option is to run custom SQL in order to implement session initialization code. Since it runs per session, it cannot be used for non-idempotent operations. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. However, this query is only for reading data, and it does not support the common examples listed above. On the other hand, there is `sessionInitStatement` option available before writing data from DataFrame. This option is to run custom SQL in order to implement session initialization code. Since it runs per session, it cannot be used for write operations. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure (it is also requested in SPARK-32014) > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > However, this query is only for reading data, and it does not support the > common examples listed above. > On the other hand, there is `sessionInitStatement` option available before > writing data from DataFrame. > This option is to run custom SQL in order to implement session > initialization code. Since it runs per session, it cannot be used for > non-idempotent operations. > > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. [https://github.com/databricks/spark-redshift] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC
[ https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32013: -- Description: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. However, this query is only for reading data, and it does not support the common examples listed above. On the other hand, there is `sessionInitStatement` option available before writing data from DataFrame. This option is to run custom SQL in order to implement session initialization code. Since it runs per session, it cannot be used for write operations. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] was: For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) Currently `query` options is available to specify SQL statement against JDBC datasource when loading data as DataFrame. [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] However, this query is only for reading data, and it does not support the common examples listed above. If Spark can support executing SQL statement against JDBC datasources before/after reading/writing over JDBC, it can cover a lot of common use-cases. Note: Databricks' old Redshift connector has similar option like `preactions` and `postactions`. [https://github.com/databricks/spark-redshift] > Support query execution before/after reading/writing over JDBC > -- > > Key: SPARK-32013 > URL: https://issues.apache.org/jira/browse/SPARK-32013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > For ETL workload, there is a common requirement to perform SQL statement > before/after reading/writing over JDBC. > Here's examples; > - Create a view with specific conditions > - Delete/Update some records > - Truncate a table (it is already possible in `truncate` option) > - Execute stored procedure (it is also requested in SPARK-32014) > Currently `query` options is available to specify SQL statement against JDBC > datasource when loading data as DataFrame. > However, this query is only for reading data, and it does not support the > common examples listed above. > On the other hand, there is `sessionInitStatement` option available before > writing data from DataFrame. > This option is to run custom SQL in order to implement session initialization > code. Since it runs per session, it cannot be used for write operations. > > If Spark can support executing SQL statement against JDBC datasources > before/after reading/writing over JDBC, it can cover a lot of common > use-cases. > Note: Databricks' old Redshift connector has similar option like `preactions` > and `postactions`. [https://github.com/databricks/spark-redshift] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142495#comment-17142495 ] Hyukjin Kwon edited comment on SPARK-31918 at 6/23/20, 12:06 AM: - Ah, yeah. That one I read [it in the release notes|https://cran.r-project.org/doc/manuals/r-devel/NEWS.html] I was freshly building and testing the package with R 4.0.1 so that was why the error messages were different ... {quote} > Packages need to be (re-)installed under this version (4.0.0) of *R*. {quote} I have two environments in my local. One is R 4.0.1, the other one is R 3.4.0. Although it officially says R 3.1+, we deprecated R < 3.4 at SPARK-26014. I will test the first option out, and come back. BTW, would you be able to test it out with a fresh build with R 4.0.0? If the issue I faced isn't my env issue, it looks tricky to handle ... [~dongjoon] do you have an existing SparkR dev env to test with R 4.0? was (Author: hyukjin.kwon): Ah, yeah. That one I read [it in the release notes|[https://cran.r-project.org/doc/manuals/r-devel/NEWS.html]] I was freshly building and testing the package with R 4.0.1 so that was why the error messages were different ... {quote} > Packages need to be (re-)installed under this version (4.0.0) of *R*. {quote} I have two environments in my local. One is R 4.0.1, the other one is R 3.4.0. Although it officially says R 3.1+, we deprecated R < 3.4 at SPARK-26014. I will test the first option out, and come back. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142495#comment-17142495 ] Hyukjin Kwon commented on SPARK-31918: -- Ah, yeah. That one I read [it in the release notes|[https://cran.r-project.org/doc/manuals/r-devel/NEWS.html]] I was freshly building and testing the package with R 4.0.1 so that was why the error messages were different ... {quote} > Packages need to be (re-)installed under this version (4.0.0) of *R*. {quote} I have two environments in my local. One is R 4.0.1, the other one is R 3.4.0. Although it officially says R 3.1+, we deprecated R < 3.4 at SPARK-26014. I will test the first option out, and come back. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Yin updated SPARK-32059: -- Description: Using tables and data structures in `SchemaPruningSuite.scala` {code:java} // code placeholder case class FullName(first: String, middle: String, last: String) case class Company(name: String, address: String) case class Employer(id: Int, company: Company) case class Contact( id: Int, name: FullName, address: String, pets: Int, friends: Array[FullName] = Array.empty, relatives: Map[String, FullName] = Map.empty, employer: Employer = null, relations: Map[FullName, String] = Map.empty) case class Department( depId: Int, depName: String, contactId: Int, employer: Employer) {code} The query to run: {code:java} // code placeholder select a.name.first from (select row_number() over (partition by address order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 'A' AND a.__rank = 1 {code} The current physical plan: {code:java} // code placeholder == Physical Plan == *(3) Project [name#46.first AS first#74] +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND (name#46.first = A)) AND (__rank#71 = 1)) +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank#71], [address#47], [id#45 DESC NULLS LAST] +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(address#47, 5), true, [id=#52] +- *(1) Project [id#45, name#46, address#47] +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct,address:string> {code} The desired physical plan: {code:java} // code placeholder == Physical Plan == *(3) Project [_gen_alias_77#77 AS first#74] +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank#71], [address#47], [id#45 DESC NULLS LAST] +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(address#47, 5), true, [id=#52] +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, address#47] +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct,address:string> {code} was: Using tables and data structures in `SchemaPruningSuite.scala` ``` case class FullName(first: String, middle: String, last: String) case class Company(name: String, address: String) case class Employer(id: Int, company: Company) case class Contact( id: Int, name: FullName, address: String, pets: Int, friends: Array[FullName] = Array.empty, relatives: Map[String, FullName] = Map.empty, employer: Employer = null, relations: Map[FullName, String] = Map.empty) case class Department( depId: Int, depName: String, contactId: Int, employer: Employer) ``` The query to run: ` select a.name.first from (select row_number() over (partition by address order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 'A' AND a.__rank = 1` The current physical plan: ``` == Physical Plan == *(3) Project [name#46.first AS first#74] +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND (name#46.first = A)) AND (__rank#71 = 1)) +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank#71], [address#47], [id#45 DESC NULLS LAST] +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(address#47, 5), true, [id=#52] +- *(1) Project [id#45, name#46, address#47] +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct,address:string> ``` The desired physical plan: ``` == Physical Plan == *(3) Project [_gen_alias_77#77 AS first#74] +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS
[jira] [Created] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
Frank Yin created SPARK-32059: - Summary: Nested Schema Pruning not Working in Window Functions Key: SPARK-32059 URL: https://issues.apache.org/jira/browse/SPARK-32059 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Frank Yin Using tables and data structures in `SchemaPruningSuite.scala` ``` case class FullName(first: String, middle: String, last: String) case class Company(name: String, address: String) case class Employer(id: Int, company: Company) case class Contact( id: Int, name: FullName, address: String, pets: Int, friends: Array[FullName] = Array.empty, relatives: Map[String, FullName] = Map.empty, employer: Employer = null, relations: Map[FullName, String] = Map.empty) case class Department( depId: Int, depName: String, contactId: Int, employer: Employer) ``` The query to run: ` select a.name.first from (select row_number() over (partition by address order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 'A' AND a.__rank = 1` The current physical plan: ``` == Physical Plan == *(3) Project [name#46.first AS first#74] +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND (name#46.first = A)) AND (__rank#71 = 1)) +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank#71], [address#47], [id#45 DESC NULLS LAST] +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(address#47, 5), true, [id=#52] +- *(1) Project [id#45, name#46, address#47] +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct,address:string> ``` The desired physical plan: ``` == Physical Plan == *(3) Project [_gen_alias_77#77 AS first#74] +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) +- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank#71], [address#47], [id#45 DESC NULLS LAST] +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(address#47, 5), true, [id=#52] +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, address#47] +- FileScan parquet [id#45,name#46,address#47,p#53] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct,address:string> ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142466#comment-17142466 ] Shivaram Venkataraman commented on SPARK-31918: --- Thanks [~hyukjin.kwon]. It looks like there is another problem. From what I saw today, R 4.0.0 cannot load packages that were built with R 3.6.0. Thus when SparkR workers try to start up with the pre-built SparkR package we see a failure. I'm not really sure what is a good way to handle this. Options include - Building the SparkR package using 4.0.0 (need to check if that works with R 3.6) - Copy the package from the driver (where it is usually built) and make the SparkR workers use the package installed on the driver Any other ideas? > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32044. --- Fix Version/s: 2.4.7 Resolution: Fixed Issue resolved by pull request 28887 [https://github.com/apache/spark/pull/28887] > [SS] 2.4 Kafka continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Trivial > Fix For: 2.4.7 > > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32044: - Assignee: Zhongwei Zhu > [SS] 2.4 Kafka continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Trivial > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default
[ https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32058: -- Comment: was deleted (was: User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28897) > Use Apache Hadoop 3.2.0 dependency by default > - > > Key: SPARK-32058 > URL: https://issues.apache.org/jira/browse/SPARK-32058 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default
[ https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32058: Assignee: Apache Spark > Use Apache Hadoop 3.2.0 dependency by default > - > > Key: SPARK-32058 > URL: https://issues.apache.org/jira/browse/SPARK-32058 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default
[ https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142449#comment-17142449 ] Apache Spark commented on SPARK-32058: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28897 > Use Apache Hadoop 3.2.0 dependency by default > - > > Key: SPARK-32058 > URL: https://issues.apache.org/jira/browse/SPARK-32058 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default
[ https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32058: Assignee: (was: Apache Spark) > Use Apache Hadoop 3.2.0 dependency by default > - > > Key: SPARK-32058 > URL: https://issues.apache.org/jira/browse/SPARK-32058 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default
[ https://issues.apache.org/jira/browse/SPARK-32058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142447#comment-17142447 ] Apache Spark commented on SPARK-32058: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28897 > Use Apache Hadoop 3.2.0 dependency by default > - > > Key: SPARK-32058 > URL: https://issues.apache.org/jira/browse/SPARK-32058 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32058) Use Apache Hadoop 3.2.0 dependency by default
Dongjoon Hyun created SPARK-32058: - Summary: Use Apache Hadoop 3.2.0 dependency by default Key: SPARK-32058 URL: https://issues.apache.org/jira/browse/SPARK-32058 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142426#comment-17142426 ] Erik Krogen commented on SPARK-32037: - Thanks for the suggestions [~H4ml3t]! * *quarantined* to me indicates that we want to avoid something bad about this spreading to other places (e.g. quarantining some corrupt data to protect other places from consuming it and spreading the corruption), which isn't the case here. * *benched* is fun, but I think not very intuitive unless you're primed with the analogy. I also am a little concerned that it will make people think of benchmarks. "Benched? Did this node fail a benchmark?" * *exiled* is interesting, but I think unhealthy still does a better job of conveying that the node/executor/etc. is doing something wrong I'm not sure about other resource managers, but at least YARN also uses the concept of unhealthy vs. healthy to refer to nodes that are not performing well. One other thing that came to mind for me was "misbehaving", which I think is really what we are describing by "unhealthy", but I think it sounds a little less smooth. > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Minor > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether this term > has racist origins, the cultural connotations are inescapable in today's > world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417 ] YoungGyu Chun edited comment on SPARK-31998 at 6/22/20, 9:16 PM: - I will be working on this when v1.0 is out was (Author: younggyuchun): I will be working on this > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142417#comment-17142417 ] YoungGyu Chun commented on SPARK-31998: --- I will be working on this > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED/CLOSED state correctly
Ali Smesseim created SPARK-32057: Summary: SparkExecuteStatementOperation does not set CANCELED/CLOSED state correctly Key: SPARK-32057 URL: https://issues.apache.org/jira/browse/SPARK-32057 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ali Smesseim https://github.com/apache/spark/pull/28671 introduced changes that changed the way cleanup is done in SparkExecuteStatementOperation. In cancel(), cleanup (killing jobs) used to be done after setting state to CANCELED. Now, the order is reversed. Jobs are killed first, causing exception to be thrown inside execute(), so the status of the operation becomes ERROR before being set to CANCELED. cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer
[ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142403#comment-17142403 ] Apache Spark commented on SPARK-32025: -- User 'planga82' has created a pull request for this issue: https://github.com/apache/spark/pull/28896 > CSV schema inference with boolean & integer > > > Key: SPARK-32025 > URL: https://issues.apache.org/jira/browse/SPARK-32025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Brian Wallace >Priority: Major > > I have a dataset consisting of two small files in CSV format. > {code:bash} > $ cat /example/f0.csv > col1 > 8589934592 > $ cat /example/f1.csv > col1 > 4320 > true > {code} > > When I try and load this in (py)spark and infer schema, my expectation is > that the column is inferred to be a string. However, it is inferred as a > boolean: > {code:python} > spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, > multiLine=True).show() > ++ > |col1| > ++ > |null| > |true| > |null| > ++ > {code} > Note that this seems to work correctly if multiLine is set to False (although > we need to set it to True as this column may indeed span multiple lines in > general). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32025) CSV schema inference with boolean & integer
[ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32025: Assignee: Apache Spark > CSV schema inference with boolean & integer > > > Key: SPARK-32025 > URL: https://issues.apache.org/jira/browse/SPARK-32025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Brian Wallace >Assignee: Apache Spark >Priority: Major > > I have a dataset consisting of two small files in CSV format. > {code:bash} > $ cat /example/f0.csv > col1 > 8589934592 > $ cat /example/f1.csv > col1 > 4320 > true > {code} > > When I try and load this in (py)spark and infer schema, my expectation is > that the column is inferred to be a string. However, it is inferred as a > boolean: > {code:python} > spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, > multiLine=True).show() > ++ > |col1| > ++ > |null| > |true| > |null| > ++ > {code} > Note that this seems to work correctly if multiLine is set to False (although > we need to set it to True as this column may indeed span multiple lines in > general). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32025) CSV schema inference with boolean & integer
[ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32025: Assignee: (was: Apache Spark) > CSV schema inference with boolean & integer > > > Key: SPARK-32025 > URL: https://issues.apache.org/jira/browse/SPARK-32025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Brian Wallace >Priority: Major > > I have a dataset consisting of two small files in CSV format. > {code:bash} > $ cat /example/f0.csv > col1 > 8589934592 > $ cat /example/f1.csv > col1 > 4320 > true > {code} > > When I try and load this in (py)spark and infer schema, my expectation is > that the column is inferred to be a string. However, it is inferred as a > boolean: > {code:python} > spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, > multiLine=True).show() > ++ > |col1| > ++ > |null| > |true| > |null| > ++ > {code} > Note that this seems to work correctly if multiLine is set to False (although > we need to set it to True as this column may indeed span multiple lines in > general). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer
[ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142402#comment-17142402 ] Apache Spark commented on SPARK-32025: -- User 'planga82' has created a pull request for this issue: https://github.com/apache/spark/pull/28896 > CSV schema inference with boolean & integer > > > Key: SPARK-32025 > URL: https://issues.apache.org/jira/browse/SPARK-32025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Brian Wallace >Priority: Major > > I have a dataset consisting of two small files in CSV format. > {code:bash} > $ cat /example/f0.csv > col1 > 8589934592 > $ cat /example/f1.csv > col1 > 4320 > true > {code} > > When I try and load this in (py)spark and infer schema, my expectation is > that the column is inferred to be a string. However, it is inferred as a > boolean: > {code:python} > spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, > multiLine=True).show() > ++ > |col1| > ++ > |null| > |true| > |null| > ++ > {code} > Note that this seems to work correctly if multiLine is set to False (although > we need to set it to True as this column may indeed span multiple lines in > general). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32050) GBTClassifier not working with OnevsRest
[ https://issues.apache.org/jira/browse/SPARK-32050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142387#comment-17142387 ] L. C. Hsieh commented on SPARK-32050: - I think this was fixed at SPARK-27007. > GBTClassifier not working with OnevsRest > > > Key: SPARK-32050 > URL: https://issues.apache.org/jira/browse/SPARK-32050 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 > Environment: spark 2.4.0 >Reporter: Raghuvarran V H >Priority: Minor > > I am trying to use GBT classifier for multi class classification using > OnevsRest > > {code:java} > from pyspark.ml.classification import > MultilayerPerceptronClassifier,OneVsRest,GBTClassifier > from pyspark.ml import Pipeline,PipelineModel > lr = GBTClassifier(featuresCol='features', labelCol='label', > predictionCol='prediction', maxDepth=5, > > maxBins=32,minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, > cacheNodeIds=False,checkpointInterval=10, lossType='logistic', > maxIter=20,stepSize=0.1, seed=None,subsamplingRate=1.0, > featureSubsetStrategy='auto') > classifier = OneVsRest(featuresCol='features', labelCol='label', > predictionCol='prediction', classifier=lr, weightCol=None,parallelism=1) > pipeline = Pipeline(stages=[str_indxr,ohe,vecAssembler,normalizer,classifier]) > model = pipeline.fit(train_data) > {code} > > > When I try this I get this error: > /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/ml/classification.py > in _fit(self, dataset) > 1800 classifier = self.getClassifier() > 1801 assert isinstance(classifier, HasRawPredictionCol),\ > -> 1802 "Classifier %s doesn't extend from HasRawPredictionCol." % > type(classifier) > 1803 > 1804 numClasses = int(dataset.agg(\{labelCol: > "max"}).head()["max("+labelCol+")"]) + 1 > AssertionError: Classifier > doesn't extend from HasRawPredictionCol. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32056) Repartition by key should support partition coalesce for AQE
[ https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-32056: -- Priority: Minor (was: Major) > Repartition by key should support partition coalesce for AQE > > > Key: SPARK-32056 > URL: https://issues.apache.org/jira/browse/SPARK-32056 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark release 3.0.0 >Reporter: koert kuipers >Priority: Minor > > when adaptive query execution is enabled the following expression should > support coalescing of partitions: > {code:java} > dataframe.repartition(col("somecolumn")) {code} > currently it does not because it simply calls the repartition implementation > where number of partitions is specified: > {code:java} > def repartition(partitionExprs: Column*): Dataset[T] = { > repartition(sparkSession.sessionState.conf.numShufflePartitions, > partitionExprs: _*) > }{code} > and repartition with the number of partitions specified does now allow for > coalescing of partitions (since this breaks the user's expectation that it > will have the number of partitions specified). > for more context see the discussion here: > [https://github.com/apache/spark/pull/27986] > a simple test to confirm that repartition by key does not support coalescing > of partitions can be added in AdaptiveQueryExecSuite like this (it currently > fails): > {code:java} > test("SPARK-32056 repartition has less partitions for small data when > adaptiveExecutionEnabled") { > Seq(true, false).foreach { enableAQE => > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "50", > SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "50") { > val partitionsNum = (1 to 10).toDF.repartition($"value") > .rdd.collectPartitions().length > if (enableAQE) { > assert(partitionsNum < 50) > } else { > assert(partitionsNum === 50) > } > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32056) Repartition by key should support partition coalesce for AQE
[ https://issues.apache.org/jira/browse/SPARK-32056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-32056: -- Description: when adaptive query execution is enabled the following expression should support coalescing of partitions: {code:java} dataframe.repartition(col("somecolumn")) {code} currently it does not because it simply calls the repartition implementation where number of partitions is specified: {code:java} def repartition(partitionExprs: Column*): Dataset[T] = { repartition(sparkSession.sessionState.conf.numShufflePartitions, partitionExprs: _*) }{code} and repartition with the number of partitions specified does now allow for coalescing of partitions (since this breaks the user's expectation that it will have the number of partitions specified). for more context see the discussion here: [https://github.com/apache/spark/pull/27986] a simple test to confirm that repartition by key does not support coalescing of partitions can be added in AdaptiveQueryExecSuite like this (it currently fails): {code:java} test("SPARK-32056 repartition has less partitions for small data when adaptiveExecutionEnabled") { Seq(true, false).foreach { enableAQE => withSQLConf( SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "50", SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "50") { val partitionsNum = (1 to 10).toDF.repartition($"value") .rdd.collectPartitions().length if (enableAQE) { assert(partitionsNum < 50) } else { assert(partitionsNum === 50) } } } } {code} was: when adaptive query execution is enabled the following expression should support coalescing of partitions: {code:java} dataframe.repartition(col("somecolumn")) {code} currently it does not because it simply calls the repartition implementation where number of partitions is specified: {code:java} def repartition(partitionExprs: Column*): Dataset[T] = { repartition(sparkSession.sessionState.conf.numShufflePartitions, partitionExprs: _*) }{code} and repartition with the number of partitions specified does now allow for coalescing of partitions (since this breaks the user's expectation that it will have the number of partitions specified). for more context see the discussion here: [https://github.com/apache/spark/pull/27986] a simple test to confirm that repartition by key does not support coalescing of partitions can be added in AdaptiveQueryExecSuite like this (it currently fails): {code:java} test("SPARK-? repartition has less partitions for small data when adaptiveExecutionEnabled") { Seq(true, false).foreach { enableAQE => withSQLConf( SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "50", SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "50") { val partitionsNum = (1 to 10).toDF.repartition($"value") .rdd.collectPartitions().length if (enableAQE) { assert(partitionsNum < 50) } else { assert(partitionsNum === 50) } } } } {code} > Repartition by key should support partition coalesce for AQE > > > Key: SPARK-32056 > URL: https://issues.apache.org/jira/browse/SPARK-32056 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark release 3.0.0 >Reporter: koert kuipers >Priority: Major > > when adaptive query execution is enabled the following expression should > support coalescing of partitions: > {code:java} > dataframe.repartition(col("somecolumn")) {code} > currently it does not because it simply calls the repartition implementation > where number of partitions is specified: > {code:java} > def repartition(partitionExprs: Column*): Dataset[T] = { > repartition(sparkSession.sessionState.conf.numShufflePartitions, > partitionExprs: _*) > }{code} > and repartition with the number of partitions specified does now allow for > coalescing of partitions (since this breaks the user's expectation that it > will have the number of partitions specified). > for more context see the discussion here: > [https://github.com/apache/spark/pull/27986] > a simple test to confirm that repartition by key does not support coalescing > of partitions can be added in AdaptiveQueryExecSuite like this (it currently > fails): > {code:java} > test("SPARK-32056 repartition has less partitions for small data when > adaptiveExecutionEnabled") { > Seq(true, false).foreach { enableAQE => > withSQLConf(
[jira] [Created] (SPARK-32056) Repartition by key should support partition coalesce for AQE
koert kuipers created SPARK-32056: - Summary: Repartition by key should support partition coalesce for AQE Key: SPARK-32056 URL: https://issues.apache.org/jira/browse/SPARK-32056 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: spark release 3.0.0 Reporter: koert kuipers when adaptive query execution is enabled the following expression should support coalescing of partitions: {code:java} dataframe.repartition(col("somecolumn")) {code} currently it does not because it simply calls the repartition implementation where number of partitions is specified: {code:java} def repartition(partitionExprs: Column*): Dataset[T] = { repartition(sparkSession.sessionState.conf.numShufflePartitions, partitionExprs: _*) }{code} and repartition with the number of partitions specified does now allow for coalescing of partitions (since this breaks the user's expectation that it will have the number of partitions specified). for more context see the discussion here: [https://github.com/apache/spark/pull/27986] a simple test to confirm that repartition by key does not support coalescing of partitions can be added in AdaptiveQueryExecSuite like this (it currently fails): {code:java} test("SPARK-? repartition has less partitions for small data when adaptiveExecutionEnabled") { Seq(true, false).foreach { enableAQE => withSQLConf( SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> enableAQE.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "50", SQLConf.COALESCE_PARTITIONS_INITIAL_PARTITION_NUM.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "50") { val partitionsNum = (1 to 10).toDF.repartition($"value") .rdd.collectPartitions().length if (enableAQE) { assert(partitionsNum < 50) } else { assert(partitionsNum === 50) } } } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32044: - Summary: [SS] 2.4 Kafka continuous processing print mislead initial offsets log (was: [SS] 2.4 Kakfa continuous processing print mislead initial offsets log ) > [SS] 2.4 Kafka continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Priority: Trivial > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32044) [SS] 2.4 Kakfa continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32044: - Summary: [SS] 2.4 Kakfa continuous processing print mislead initial offsets log (was: [SS] Kakfa continuous processing print mislead initial offsets log ) > [SS] 2.4 Kakfa continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Priority: Trivial > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-32055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142276#comment-17142276 ] Apache Spark commented on SPARK-32055: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28895 > Unify getReader and getReaderForRange in ShuffleManager > --- > > Key: SPARK-32055 > URL: https://issues.apache.org/jira/browse/SPARK-32055 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Unify getReader and getReaderForRange in ShuffleManager in order to simplify > the implementation and ease the code maintenance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-32055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32055: Assignee: (was: Apache Spark) > Unify getReader and getReaderForRange in ShuffleManager > --- > > Key: SPARK-32055 > URL: https://issues.apache.org/jira/browse/SPARK-32055 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Unify getReader and getReaderForRange in ShuffleManager in order to simplify > the implementation and ease the code maintenance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-32055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32055: Assignee: Apache Spark > Unify getReader and getReaderForRange in ShuffleManager > --- > > Key: SPARK-32055 > URL: https://issues.apache.org/jira/browse/SPARK-32055 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Unify getReader and getReaderForRange in ShuffleManager in order to simplify > the implementation and ease the code maintenance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32038) Regression in handling NaN values in COUNT(DISTINCT)
[ https://issues.apache.org/jira/browse/SPARK-32038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142265#comment-17142265 ] Mithun Radhakrishnan commented on SPARK-32038: -- [~viirya], [~dongjoon], thank you. I'm amazed at the quick resolution of this bug. > Regression in handling NaN values in COUNT(DISTINCT) > > > Key: SPARK-32038 > URL: https://issues.apache.org/jira/browse/SPARK-32038 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mithun Radhakrishnan >Assignee: L. C. Hsieh >Priority: Blocker > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > > There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} > values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an > illustration: > {code:scala} > case class Test( uid:String, score:Float) > val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f81) > val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fff) > val rows = Seq( > Test("mithunr", Float.NaN), > Test("mithunr", POS_NAN_1), > Test("mithunr", POS_NAN_2), > Test("abellina", 1.0f), > Test("abellina", 2.0f) > ).toDF.createOrReplaceTempView("mytable") > spark.sql(" select uid, count(distinct score) from mytable group by 1 order > by 1 asc ").show > {code} > Here are the results under Spark 3.0.0: > {code:java|title=Spark 3.0.0 (single aggregation)} > ++-+ > | uid|count(DISTINCT score)| > ++-+ > |abellina|2| > | mithunr|3| > ++-+ > {code} > Note that the count against {{mithunr}} is {{3}}, accounting for each > distinct value for {{NaN}}. > The right results are returned when another aggregation is added to the GBY: > {code:scala|title=Spark 3.0.0 (multiple aggregations)} > scala> spark.sql(" select uid, count(distinct score), max(score) from mytable > group by 1 order by 1 asc ").show > ++-+--+ > | uid|count(DISTINCT score)|max(score)| > ++-+--+ > |abellina|2| 2.0| > | mithunr|1| NaN| > ++-+--+ > {code} > Also, note that Spark 2.4.6 normalizes the {{DISTINCT}} expression correctly: > {code:scala|title=Spark 2.4.6} > scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 > order by 1 asc ").show > ++-+ > | uid|count(DISTINCT score)| > ++-+ > |abellina|2| > | mithunr|1| > ++-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32055) Unify getReader and getReaderForRange in ShuffleManager
wuyi created SPARK-32055: Summary: Unify getReader and getReaderForRange in ShuffleManager Key: SPARK-32055 URL: https://issues.apache.org/jira/browse/SPARK-32055 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: wuyi Unify getReader and getReaderForRange in ShuffleManager in order to simplify the implementation and ease the code maintenance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142234#comment-17142234 ] Gabor Somogyi commented on SPARK-31995: --- At the first glance I don't think so it's a Spark but HDFS issue. Adding "sleep_for_a_while" on Spark side would just hide the original problem. Please search for "Unable to close file because the last block does not have enough number of replicas", there are couple of hits suggestion possible workarounds. I've taken a look at the jiras on hadoop side and as I've seen this has been resolved in 2.7.4+. Could you reproduce the issue w/ 3.0? > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557) > at >
[jira] [Created] (SPARK-32054) Flaky test: org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet V2 to V1
Gabor Somogyi created SPARK-32054: - Summary: Flaky test: org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet V2 to V1 Key: SPARK-32054 URL: https://issues.apache.org/jira/browse/SPARK-32054 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.1.0 Reporter: Gabor Somogyi https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124364/testReport/org.apache.spark.sql.connector/FileDataSourceV2FallBackSuite/Fallback_Parquet_V2_to_V1/ {code:java} Error Message org.scalatest.exceptions.TestFailedException: ArrayBuffer((collect,Relation[id#387495L] parquet ), (save,InsertIntoHadoopFsRelationCommand file:/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776, false, Parquet, Map(path -> /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776), ErrorIfExists, [id] +- Range (0, 10, step=1, splits=Some(2)) )) had length 2 instead of expected length 1 Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: ArrayBuffer((collect,Relation[id#387495L] parquet ), (save,InsertIntoHadoopFsRelationCommand file:/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776, false, Parquet, Map(path -> /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776), ErrorIfExists, [id] +- Range (0, 10, step=1, splits=Some(2)) )) had length 2 instead of expected length 1 at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$22(FileDataSourceV2FallBackSuite.scala:180) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$22$adapted(FileDataSourceV2FallBackSuite.scala:176) at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$21(FileDataSourceV2FallBackSuite.scala:176) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(FileDataSourceV2FallBackSuite.scala:85) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.withSQLConf(FileDataSourceV2FallBackSuite.scala:85) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$20(FileDataSourceV2FallBackSuite.scala:158) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$20$adapted(FileDataSourceV2FallBackSuite.scala:157) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$19(FileDataSourceV2FallBackSuite.scala:157) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at
[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31918: - Priority: Blocker (was: Major) > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ermakov resolved SPARK-29773. --- Fix Version/s: 2.4.4 Resolution: Fixed > Unable to process empty ORC files in Hive Table using Spark SQL > --- > > Key: SPARK-29773 > URL: https://issues.apache.org/jira/browse/SPARK-29773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: Centos 7, Spark 2.3.1, Hive 2.3.0 >Reporter: Alexander Ermakov >Priority: Major > Fix For: 2.4.4 > > > Unable to process empty ORC files in Hive Table using Spark SQL. It seems > that a problem with class > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits() > Stack trace: > {code:java} > 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct > _tech_load_dt from dl_raw.tpaccsieee_ut_data_address] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange hashpartitioning(_tech_load_dt#1374, 200) > +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], > output=[_tech_load_dt#1374]) >+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation > `dl_raw`.`tpaccsieee_ut_data_address`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, > address_adm#1309, address_md#1310, adress_doc#1311, building#1312, > change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, > city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, > code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, > date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, > district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Commented] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142177#comment-17142177 ] Alexander Ermakov commented on SPARK-29773: --- This issue has been resolved for Spark 2.4.4 > Unable to process empty ORC files in Hive Table using Spark SQL > --- > > Key: SPARK-29773 > URL: https://issues.apache.org/jira/browse/SPARK-29773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: Centos 7, Spark 2.3.1, Hive 2.3.0 >Reporter: Alexander Ermakov >Priority: Major > > Unable to process empty ORC files in Hive Table using Spark SQL. It seems > that a problem with class > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits() > Stack trace: > {code:java} > 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct > _tech_load_dt from dl_raw.tpaccsieee_ut_data_address] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange hashpartitioning(_tech_load_dt#1374, 200) > +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], > output=[_tech_load_dt#1374]) >+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation > `dl_raw`.`tpaccsieee_ut_data_address`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, > address_adm#1309, address_md#1310, adress_doc#1311, building#1312, > change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, > city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, > code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, > date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, > district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142127#comment-17142127 ] Hyukjin Kwon commented on SPARK-31918: -- Just to share what I investigated: Seems the problem relates to {{processClosure}} via {{cleanClosure}} in SparkR. Looks like there is a problem [when the new environment is set to a function|https://github.com/apache/spark/blob/master/R/pkg/R/utils.R#L601] especially that includes generic S4 functions, given my observation. So, for example, if you skip it with the fix below: {code:java} diff --git a/R/pkg/R/utils.R b/R/pkg/R/utils.R index 65db9c21d9d..60cad588f5e 100644 --- a/R/pkg/R/utils.R +++ b/R/pkg/R/utils.R @@ -529,7 +529,9 @@ processClosure <- function(node, oldEnv, defVars, checkedFuncs, newEnv) { # Namespaces other than "SparkR" will not be searched. if (!isNamespace(func.env) || (getNamespaceName(func.env) == "SparkR" && - !(nodeChar %in% getNamespaceExports("SparkR" { + !(nodeChar %in% getNamespaceExports("SparkR")) && + # Skip all generics under SparkR - R 4.0.0 looks having an issue. + !isGeneric(nodeChar, func.env))) { {code} {code:java} * checking re-building of vignette outputs ... OK {code} CRAN check passes with the current master branch in my local For a minimal reproducer, with this diff: {code:java} diff --git a/R/pkg/R/RDD.R b/R/pkg/R/RDD.R index 7a1d157bb8a..89250c37319 100644 --- a/R/pkg/R/RDD.R +++ b/R/pkg/R/RDD.R @@ -487,6 +487,7 @@ setMethod("lapply", func <- function(partIndex, part) { lapply(part, FUN) } +print(SparkR:::cleanClosure(func)(1, 2)) lapplyPartitionsWithIndex(X, func) }) {code} run: {code:java} createDataFrame(lapply(seq(100), function (e) list(value=e))) {code} When {{lapply}} is called against the RDD at {{createDataFrame}}, the cleaned closure's environment has SparkR's lapply as a S4 method and it leads to the error such as {{attempt to bind a variable to R_UnboundValue}}. Hopefully this is the cause of the issue happening here, and not an issue in my env. cc [~felixcheung], [~dongjoon] FYI. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Major > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142073#comment-17142073 ] Hyukjin Kwon edited comment on SPARK-31918 at 6/22/20, 2:10 PM: It affects Spark 3.0 too, and seems failing with a different message in my local: {code} * creating vignettes ... ERROR --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown Warning in engine$weave(file, quiet = quiet, encoding = enc) : Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R Markdown v1. Attaching package: 'SparkR' The following objects are masked from 'package:stats': cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from 'package:base': as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union Picked up _JAVA_OPTIONS: -XX:-UsePerfData Picked up _JAVA_OPTIONS: -XX:-UsePerfData 20/06/22 15:07:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [Stage 0:> (0 + 1) / 1] 20/06/22 15:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue {code} Assuming the errors from R execution itself, the root cause might be same. was (Author: hyukjin.kwon): It affects Spark 3.0 too, and seems failing with a different message in my local: {code} * creating vignettes ... ERROR --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown Warning in engine$weave(file, quiet = quiet, encoding = enc) : Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R Markdown v1. Attaching package: 'SparkR' The following objects are masked from 'package:stats': cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from 'package:base': as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union Picked up _JAVA_OPTIONS: -XX:-UsePerfData Picked up _JAVA_OPTIONS: -XX:-UsePerfData 20/06/22 15:07:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [Stage 0:> (0 + 1) / 1] 20/06/22 15:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: R unexpectedly exited. {code} Assuming the errors from R execution itself, the root cause might be same. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Major > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142073#comment-17142073 ] Hyukjin Kwon commented on SPARK-31918: -- It affects Spark 3.0 too, and seems failing with a different message in my local: {code} * creating vignettes ... ERROR --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown Warning in engine$weave(file, quiet = quiet, encoding = enc) : Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R Markdown v1. Attaching package: 'SparkR' The following objects are masked from 'package:stats': cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from 'package:base': as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union Picked up _JAVA_OPTIONS: -XX:-UsePerfData Picked up _JAVA_OPTIONS: -XX:-UsePerfData 20/06/22 15:07:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [Stage 0:> (0 + 1) / 1] 20/06/22 15:07:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: R unexpectedly exited. {code} Assuming the errors from R execution itself, the root cause might be same. > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Major > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31918: - Affects Version/s: 3.0.0 > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Priority: Major > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org