[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022719#comment-17022719 ] Tobias Hermann commented on SPARK-30421: [~dongjoon] No, that's different. To make it equivalent, you'd have to change your example to the following: {quote}import pandas as pd df = pd.DataFrame(data=\{'foo': [0, 1], 'bar': ["a", "b"]}) df2 = df.drop(columns=["bar"]) df2[df2["bar"] == "a"] {quote} And that correctly results in {quote}KeyError: 'bar' {quote} In Spark, however, the following code works without error: {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") val df2 = df.drop("bar") df2.where($"bar" === "a").show {quote} > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tobias Hermann updated SPARK-30421: --- Comment: was deleted (was: [~dongjoon] Thanks, I think that's not good. So I just opened a Pandas issue too. :D [https://github.com/pandas-dev/pandas/issues/31272]) > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022715#comment-17022715 ] Tobias Hermann commented on SPARK-30421: [~dongjoon] Thanks, I think that's not good. So I just opened a Pandas issue too. :D [https://github.com/pandas-dev/pandas/issues/31272] > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30631) Mitigate SQL injections - can't parameterize query parameters for JDBC connectors
Jorge created SPARK-30631: - Summary: Mitigate SQL injections - can't parameterize query parameters for JDBC connectors Key: SPARK-30631 URL: https://issues.apache.org/jira/browse/SPARK-30631 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: Jorge One of the options to read from a JDBC connection is a query. Sometimes, this query is parameterized (e.g. column name, values, etc). The JDBC API does not support parameterizing SQL queries, which puts the burden of escaping SQL on the developer. This burden is unnecessary and a security risk. Very often, drivers provide a specific API to securely parameterize SQL statements. This issue proposes allowing the developers to pass "query" and "parameters" to the JDBC options, so that it is the driver, not the developer, that escape parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30630) Deprecate numTrees in GBT
Huaxin Gao created SPARK-30630: -- Summary: Deprecate numTrees in GBT Key: SPARK-30630 URL: https://issues.apache.org/jira/browse/SPARK-30630 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.5, 3.0.0 Reporter: Huaxin Gao Currently, GBT has {code:java} /** * Number of trees in ensemble */ @Since("2.0.0") val getNumTrees: Int = trees.length{code} and {code:java} /** Number of trees in ensemble */ val numTrees: Int = trees.length{code} I will deprecate numTrees in 2.4.5 and remove it in 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30627) Disable all the V2 file sources in Spark 3.0 by default
[ https://issues.apache.org/jira/browse/SPARK-30627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30627. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27348 [https://github.com/apache/spark/pull/27348] > Disable all the V2 file sources in Spark 3.0 by default > --- > > Key: SPARK-30627 > URL: https://issues.apache.org/jira/browse/SPARK-30627 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > There are still some missing parts in the file source V2 framework: > 1. It doesn't support reporting file scan metrics such as > "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. > 2. It doesn't support partition pruning with subqueries or dynamic partition > pruning. > As we are going to code freeze on Jan 31st, I suggest disabling all the V2 > file sources in Spark 3.0 by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-14643: - Assignee: (was: Josh Rosen) > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Task > Components: Build, Project Infra >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14643: Priority: Blocker (was: Major) > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Task > Components: Build, Project Infra >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Blocker > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14643: Target Version/s: 3.0.0 > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Task > Components: Build, Project Infra >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Blocker > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022694#comment-17022694 ] Dongjoon Hyun commented on SPARK-30421: --- Technically, Python `pandas` follows the same lazy manner, [~tobias_hermann]. {code} >>> df col1 col2 0 1 3 1 2 4 >>> df.drop(columns=["col1"]).loc[df["col1"] == 1] col2 0 3 {code} > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables
[ https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022678#comment-17022678 ] Burak Yavuz commented on SPARK-30612: - I prefer SubqueryAlias. We need to support all degrees of the user provided identifier I believe: SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl should all work. However I'm not sure if SELECT spark_catalog.default.tbl.foo FROM tbl should work. Are my assumptions correct? > can't resolve qualified column name with v2 tables > -- > > Key: SPARK-30612 > URL: https://issues.apache.org/jira/browse/SPARK-30612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > When running queries with qualified columns like `SELECT t.a FROM t`, it > fails to resolve for v2 tables. > v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We > should do the same for v2 tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022676#comment-17022676 ] Dongjoon Hyun commented on SPARK-30617: --- Thanks, [~994184...@qq.com]. I added the existing JIRAs. I'd like to recommend the followings according to the community guide. - https://spark.apache.org/contributing.html 1. Please don't set `Fix Versions`. That is used by the committer when the PR is merged finally. 2. For `Affected Version`, please set the master branch version number for the new feature JIRA. (For now, it's 3.0.0.) Since Apache Spark allows bug-fix backporting only, there is no way to affect released versions. 3. If possible, please search before creating a JIRA. Usually, people think in the similar ways. > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables
[ https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022675#comment-17022675 ] Terry Kim commented on SPARK-30612: --- Thanks [~brkyvz] There are two approaches we can take. One is to wrap v2 table with `SubqueryAlias`. Another is to update `DataSourceV2Relation`'s output (Seq[AttributeReference]) to have qualifier directly (after SPARK-30314). Which route should I take? > can't resolve qualified column name with v2 tables > -- > > Key: SPARK-30612 > URL: https://issues.apache.org/jira/browse/SPARK-30612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > When running queries with qualified columns like `SELECT t.a FROM t`, it > fails to resolve for v2 tables. > v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We > should do the same for v2 tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30617: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30617: -- Fix Version/s: (was: 2.4.6) (was: 3.1.0) > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: weiwenda >Priority: Minor > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow
Maciej Szymkiewicz created SPARK-30629: -- Summary: cleanClosure on recursive call leads to node stack overflow Key: SPARK-30629 URL: https://issues.apache.org/jira/browse/SPARK-30629 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 3.0.0 Reporter: Maciej Szymkiewicz This problem surfaced while handling SPARK-22817. In theory there are tests, which cover that problem, but it seems like they have been dead for some reason. Reproducible example {code:r} f <- function(x) { f(x) } newF <- cleanClosure(f) {code} Just looking at the {{cleanClosure}} / {{processClosure}} pair, that function that is being processed is not added to {{checkedFuncs}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022631#comment-17022631 ] Dongjoon Hyun commented on SPARK-28921: --- Thank you for updating, [~thesuperzapper]. What problem did you hit when you don't change the others? BTW, 2.4.5 RC2 vote is coming. > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022630#comment-17022630 ] Dongjoon Hyun commented on SPARK-30218: --- How do you disambiguate them? Could you describe your idea? > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022630#comment-17022630 ] Dongjoon Hyun edited comment on SPARK-30218 at 1/24/20 1:16 AM: How do you disambiguate them? Could you describe your idea, [~rkins]? was (Author: dongjoon): How do you disambiguate them? Could you describe your idea? > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30625) Add `escapeChar` parameter to the `like` function
[ https://issues.apache.org/jira/browse/SPARK-30625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022628#comment-17022628 ] Dongjoon Hyun commented on SPARK-30625: --- +1 > Add `escapeChar` parameter to the `like` function > - > > Key: SPARK-30625 > URL: https://issues.apache.org/jira/browse/SPARK-30625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > SPARK-28083 supported LIKE ... ESCAPE syntax > {code:sql} > spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_'; > true > {code} > but the `like` function can accept only 2 parameters. If we pass the third > one, it fails with: > {code:sql} > spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); > Error in query: Invalid number of arguments for function like. Expected: 2; > Found: 3; line 1 pos 7 > {code} > The ticket aims to support the third parameter in `like` as `escapeChar`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022625#comment-17022625 ] Mathew Wicks commented on SPARK-28921: -- It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, you must also replace: * $SPARK_HOME/jars/kubernetes-client-*.jar * $SPARK_HOME/jars/kubernetes-model-common-*jar * $SPARK_HOME/jars/kubernetes-model-*.jar * $SPARK_HOME/jars/okhttp-*.jar * $SPARK_HOME/jars/okio-*.jar With the versions specified in this PR: https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18 > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022625#comment-17022625 ] Mathew Wicks edited comment on SPARK-28921 at 1/24/20 1:03 AM: --- It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, you must also replace: * $SPARK_HOME/jars/kubernetes-client-*.jar * $SPARK_HOME/jars/kubernetes-model-common-*jar * $SPARK_HOME/jars/kubernetes-model-*.jar * $SPARK_HOME/jars/okhttp-*.jar * $SPARK_HOME/jars/okio-*.jar With the versions specified in this PR: [https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18] was (Author: thesuperzapper): It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, you must also replace: * $SPARK_HOME/jars/kubernetes-client-*.jar * $SPARK_HOME/jars/kubernetes-model-common-*jar * $SPARK_HOME/jars/kubernetes-model-*.jar * $SPARK_HOME/jars/okhttp-*.jar * $SPARK_HOME/jars/okio-*.jar With the versions specified in this PR: https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18 > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable
[ https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022617#comment-17022617 ] Terry Kim commented on SPARK-30615: --- Thanks [~brkyvz] for heads up. > normalize the column name in AlterTable > --- > > Key: SPARK-30615 > URL: https://issues.apache.org/jira/browse/SPARK-30615 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Because of case insensitive resolution, the column name in AlterTable may > match the table schema but not exactly the same. To ease DS v2 > implementations, Spark should normalize the column name before passing them > to v2 catalogs, so that users don't need to care about the case sensitive > config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30628) File source V2: Support partition pruning with subqueries
Gengliang Wang created SPARK-30628: -- Summary: File source V2: Support partition pruning with subqueries Key: SPARK-30628 URL: https://issues.apache.org/jira/browse/SPARK-30628 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022596#comment-17022596 ] Rahul Kumar Challapalli commented on SPARK-30218: - We currently are detecting that there is a self-join, but the OP seems to be asking about why spark doesn't disambiguate the columns. So I am not sure if we can close this issue. Thoughts? > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30627) Disable all the V2 file sources in Spark 3.0 by default
Gengliang Wang created SPARK-30627: -- Summary: Disable all the V2 file sources in Spark 3.0 by default Key: SPARK-30627 URL: https://issues.apache.org/jira/browse/SPARK-30627 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang There are still some missing parts in the file source V2 framework: 1. It doesn't support reporting file scan metrics such as "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. 2. It doesn't support partition pruning with subqueries or dynamic partition pruning. As we are going to code freeze on Jan 31st, I suggest disabling all the V2 file sources in Spark 3.0 by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30298) bucket join cannot work for self-join with views
[ https://issues.apache.org/jira/browse/SPARK-30298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30298. -- Fix Version/s: 3.0.0 Assignee: Terry Kim Resolution: Fixed Resolved by https://github.com/apache/spark/pull/26943 > bucket join cannot work for self-join with views > > > Key: SPARK-30298 > URL: https://issues.apache.org/jira/browse/SPARK-30298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiaoju Wu >Assignee: Terry Kim >Priority: Minor > Fix For: 3.0.0 > > > This UT may fail at the last line: > {code:java} > test("bucket join cannot work for self-join with views") { > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1") { > withTable("t1") { > val df = (0 until 20).map(i => (i, i)).toDF("i", "j").as("df") > df.write > .format("parquet") > .bucketBy(8, "i") > .saveAsTable("t1") > sql(s"create view v1 as select * from t1").collect() > val plan1 = sql("SELECT * FROM t1 a JOIN t1 b ON a.i = > b.i").queryExecution.executedPlan > assert(plan1.collect { case exchange : ShuffleExchangeExec => > exchange }.isEmpty) > val plan2 = sql("SELECT * FROM t1 a JOIN v1 b ON a.i = > b.i").queryExecution.executedPlan > assert(plan2.collect { case exchange : ShuffleExchangeExec => > exchange }.isEmpty) > } > } > } > {code} > It's because View will add Project with Alias, then Join's > requiredDistribution is based on Alias, but ProjectExec passes child's > outputPartition up without Alias. Then the satisfies check cannot meet in > EnsureRequirement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28396) Add PathCatalog for data source V2
[ https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-28396. Resolution: Won't Fix > Add PathCatalog for data source V2 > -- > > Key: SPARK-28396 > URL: https://issues.apache.org/jira/browse/SPARK-28396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Add PathCatalog for data source V2, so that: > 1. We can convert SaveMode in DataFrameWriter into catalog table operations, > instead of supporting SaveMode in file source V2. > 2. Support create-table SQL statements like "CREATE TABLE orc.'path'" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30625) Add `escapeChar` parameter to the `like` function
[ https://issues.apache.org/jira/browse/SPARK-30625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022579#comment-17022579 ] Takeshi Yamamuro commented on SPARK-30625: -- Yea, supporting that looks fine to me. > Add `escapeChar` parameter to the `like` function > - > > Key: SPARK-30625 > URL: https://issues.apache.org/jira/browse/SPARK-30625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > SPARK-28083 supported LIKE ... ESCAPE syntax > {code:sql} > spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_'; > true > {code} > but the `like` function can accept only 2 parameters. If we pass the third > one, it fails with: > {code:sql} > spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); > Error in query: Invalid number of arguments for function like. Expected: 2; > Found: 3; line 1 pos 7 > {code} > The ticket aims to support the third parameter in `like` as `escapeChar`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable
[ https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022577#comment-17022577 ] Burak Yavuz commented on SPARK-30615: - I actually had a PR in progress on this. Let me push that > normalize the column name in AlterTable > --- > > Key: SPARK-30615 > URL: https://issues.apache.org/jira/browse/SPARK-30615 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Because of case insensitive resolution, the column name in AlterTable may > match the table schema but not exactly the same. To ease DS v2 > implementations, Spark should normalize the column name before passing them > to v2 catalogs, so that users don't need to care about the case sensitive > config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022576#comment-17022576 ] Nicholas Chammas commented on SPARK-19248: -- Thanks for getting to the bottom of the issue, [~jeff.w.evans], and for providing a workaround. Would an appropriate solution be to make {{escapedStringLiterals}} default to {{True}}? Or does that cause other problems? > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds
[ https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022569#comment-17022569 ] Jim Kleckner commented on SPARK-30275: -- * Builds on my Mac use whatever I have installed on my machine whereas having a well-defined remote CI system eliminates variability. * The build process doesn't load my local system. * A push is just a git push rather than an image push which from home can take a long time since my ISP has very wimpy upload speeds. Obviously some CI/CD tooling exists for spark testing and release on the back end, but that isn't available to most people. > Add gitlab-ci.yml file for reproducible builds > -- > > Key: SPARK-30275 > URL: https://issues.apache.org/jira/browse/SPARK-30275 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jim Kleckner >Priority: Minor > > It would be desirable to have public reproducible builds such as provided by > gitlab or others. > > Here is a candidate patch set to build spark using gitlab-ci: > * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml > Let me know if there is interest in a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022527#comment-17022527 ] Thomas Graves commented on SPARK-28403: --- so after looking at the pr for this, this logic may have been an attempt to get executors on different hosts. The speculation logic in the scheduler is such that it will only run a speculative task on a different host then the current running task. > Executor Allocation Manager can add an extra executor when speculative tasks > > > Key: SPARK-28403 > URL: https://issues.apache.org/jira/browse/SPARK-28403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > It looks like SPARK-19326 added a bug in the execuctor allocation maanger > where it adds an extra executor when it shouldn't when we have pending > speculative tasks but the target number didn't change. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377] > It doesn't look like this is necessary since it already added in the > pendingSpeculative tasks. > See the questioning of this on the PR at: > https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30603) Keep the reserved properties of namespaces and tables private
[ https://issues.apache.org/jira/browse/SPARK-30603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30603: - Assignee: Kent Yao > Keep the reserved properties of namespaces and tables private > - > > Key: SPARK-30603 > URL: https://issues.apache.org/jira/browse/SPARK-30603 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > the reserved properties of namespaces and tables should be private -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30603) Keep the reserved properties of namespaces and tables private
[ https://issues.apache.org/jira/browse/SPARK-30603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30603. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27318 [https://github.com/apache/spark/pull/27318] > Keep the reserved properties of namespaces and tables private > - > > Key: SPARK-30603 > URL: https://issues.apache.org/jira/browse/SPARK-30603 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > the reserved properties of namespaces and tables should be private -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30570) Update scalafmt to 1.0.3 with onlyChangedFiles feature
[ https://issues.apache.org/jira/browse/SPARK-30570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30570. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27279 [https://github.com/apache/spark/pull/27279] > Update scalafmt to 1.0.3 with onlyChangedFiles feature > -- > > Key: SPARK-30570 > URL: https://issues.apache.org/jira/browse/SPARK-30570 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Cody Koeninger >Assignee: Cody Koeninger >Priority: Minor > Fix For: 3.0.0 > > > [https://github.com/SimonJPegg/mvn_scalafmt/releases/tag/v1.0.3] > added an option onlyChangedFiles which was one of the things holding back the > upgrade in SPARK-29293 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30570) Update scalafmt to 1.0.3 with onlyChangedFiles feature
[ https://issues.apache.org/jira/browse/SPARK-30570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30570: - Assignee: Cody Koeninger > Update scalafmt to 1.0.3 with onlyChangedFiles feature > -- > > Key: SPARK-30570 > URL: https://issues.apache.org/jira/browse/SPARK-30570 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Cody Koeninger >Assignee: Cody Koeninger >Priority: Minor > > [https://github.com/SimonJPegg/mvn_scalafmt/releases/tag/v1.0.3] > added an option onlyChangedFiles which was one of the things holding back the > upgrade in SPARK-29293 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430 ] Jeff Evans edited comment on SPARK-19248 at 1/23/20 8:06 PM: - After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. Otherwise, you need to escape the literal backslash before the dot character (and of course, in a string literal, the backslashes themselves also need escaped), so you would need the pattern to be {{'( |.)*'}} By the way, this isn't Python-specific behavior. Even if you use a Scala session, and use the {{expr}} expression (which I don't see in the sample sessions above), you will notice the same thing happening. {code} val df = spark.createDataFrame(Seq((0, ".. 5."))).toDF("id","col") df.selectExpr("""regexp_replace(col, "( |\.)*", "")""").show() +-+ |regexp_replace(col, ( |.)*, )| +-+ | | +-+ spark.conf.set("spark.sql.parser.escapedStringLiterals", true) df.selectExpr("""regexp_replace(col, "( |\.)*", "")""").show() +--+ |regexp_replace(col, ( |\.)*, )| +--+ | 5| +--+ {code} was (Author: jeff.w.evans): After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. Otherwise, you need to escape the literal backslash before the dot character (and of course, in a string literal, the backslashes themselves also need escaped), so you would need the pattern to be {{'( |.)*'}} > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430 ] Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM: - After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. Otherwise, you need to escape the literal backslash before the dot character, so you would need the pattern to be {{'( |.)*'}} was (Author: jeff.w.evans): After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430 ] Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM: - After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. Otherwise, you need to escape the literal backslash before the dot character (and of course, in a string literal, the backslashes themselves also need escaped), so you would need the pattern to be {{'( |.)*'}} was (Author: jeff.w.evans): After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. Otherwise, you need to escape the literal backslash before the dot character, so you would need the pattern to be {{'( |.)*'}} > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430 ] Jeff Evans commented on SPARK-19248: After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022426#comment-17022426 ] Jiaxin Shan commented on SPARK-30626: - I have an improvement change for this and let me create a PR > [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jiaxin Shan >Priority: Minor > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaxin Shan updated SPARK-30626: Summary: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env (was: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID) > [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jiaxin Shan >Priority: Minor > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID
[ https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaxin Shan updated SPARK-30626: Description: This should be a minor improvement. The use case is we want to look up environment variables and create application folder and redirect driver logs to application folder. Executors has it and we want to make a change to driver as well. {code:java} Limits: cpu: 1024m memory: 896Mi Requests: cpu: 1 memory: 896Mi Environment: SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e SPARK_CONF_DIR: /opt/spark/conf{code} [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] We need SPARK_APPLICATION_ID inside the pod to organize logs was: This should be a minor improvement. The use case is we want to look up environment variables and create application folder and redirect driver logs to application folder. Executors has it and we want to make a change to driver as well. ``` Limits: cpu: 1024m memory: 896Mi Requests: cpu: 1 memory: 896Mi Environment: SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e SPARK_CONF_DIR: /opt/spark/conf ``` https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79 We need SPARK_APPLICATION_ID inside the pod to organize logs > [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID > > > Key: SPARK-30626 > URL: https://issues.apache.org/jira/browse/SPARK-30626 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jiaxin Shan >Priority: Minor > > This should be a minor improvement. > The use case is we want to look up environment variables and create > application folder and redirect driver logs to application folder. Executors > has it and we want to make a change to driver as well. > > {code:java} > Limits: > cpu: 1024m > memory: 896Mi > Requests: > cpu: 1 > memory: 896Mi > Environment: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e > SPARK_CONF_DIR: /opt/spark/conf{code} > > [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79] > We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID
Jiaxin Shan created SPARK-30626: --- Summary: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID Key: SPARK-30626 URL: https://issues.apache.org/jira/browse/SPARK-30626 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.4, 3.0.0 Reporter: Jiaxin Shan This should be a minor improvement. The use case is we want to look up environment variables and create application folder and redirect driver logs to application folder. Executors has it and we want to make a change to driver as well. ``` Limits: cpu: 1024m memory: 896Mi Requests: cpu: 1 memory: 896Mi Environment: SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e SPARK_CONF_DIR: /opt/spark/conf ``` https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79 We need SPARK_APPLICATION_ID inside the pod to organize logs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution
[ https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022354#comment-17022354 ] Ivelin Tchangalov commented on SPARK-27913: --- I'm curious if there's any progress or solution for this issue. > Spark SQL's native ORC reader implements its own schema evolution > - > > Key: SPARK-27913 > URL: https://issues.apache.org/jira/browse/SPARK-27913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Owen O'Malley >Priority: Major > > ORC's reader handles a wide range of schema evolution, but the Spark SQL > native ORC bindings do not provide the desired schema to the ORC reader. This > causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30625) Add `escapeChar` parameter to the `like` function
[ https://issues.apache.org/jira/browse/SPARK-30625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022398#comment-17022398 ] Maxim Gekk commented on SPARK-30625: I have implemented the feature but I am not sure it is useful. Should I submit a PR for that, WDYT [~dongjoon] [~Gengliang.Wang] [~cloud_fan] [~beliefer] [~maropu] [~hyukjin.kwon] ? > Add `escapeChar` parameter to the `like` function > - > > Key: SPARK-30625 > URL: https://issues.apache.org/jira/browse/SPARK-30625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > SPARK-28083 supported LIKE ... ESCAPE syntax > {code:sql} > spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_'; > true > {code} > but the `like` function can accept only 2 parameters. If we pass the third > one, it fails with: > {code:sql} > spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); > Error in query: Invalid number of arguments for function like. Expected: 2; > Found: 3; line 1 pos 7 > {code} > The ticket aims to support the third parameter in `like` as `escapeChar`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022348#comment-17022348 ] Sofia commented on SPARK-27282: --- No idea [~tgraves], I'm still working with spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1) and scala ==> 2.11.8. When I tried debugging (in a first level) using explain(true), I find out that the main reason of this error is the misuse of *+ReuseExchange in the optimized plan+*. I used a workaround to handle this issue. > Spark incorrect results when using UNION with GROUP BY clause > - > > Key: SPARK-27282 > URL: https://issues.apache.org/jira/browse/SPARK-27282 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, SQL >Affects Versions: 2.3.2 > Environment: I'm using : > IntelliJ IDEA ==> 2018.1.4 > spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1) > scala ==> 2.11.8 >Reporter: Sofia >Priority: Blocker > Labels: correctness > > When using UNION clause after a GROUP BY clause in spark, the results > obtained are wrong. > The following example explicit this issue: > {code:java} > CREATE TABLE test_un ( > col1 varchar(255), > col2 varchar(255), > col3 varchar(255), > col4 varchar(255) > ); > INSERT INTO test_un (col1, col2, col3, col4) > VALUES (1,1,2,4), > (1,1,2,4), > (1,1,3,5), > (2,2,2,null); > {code} > I used the following code : > {code:java} > val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un") > val y = x >.filter(col("col4")isNotNull) > .groupBy("col1", "col2","col3") > .agg(count(col("col3")).alias("cnt")) > .withColumn("col_name", lit("col3")) > .select(col("col1"), col("col2"), > col("col_name"),col("col3").alias("col_value"), col("cnt")) > val z = x > .filter(col("col4")isNotNull) > .groupBy("col1", "col2","col4") > .agg(count(col("col4")).alias("cnt")) > .withColumn("col_name", lit("col4")) > .select(col("col1"), col("col2"), > col("col_name"),col("col4").alias("col_value"), col("cnt")) > y.union(z).show() > {code} > And i obtained the following results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|5|1| > |1|1|col3|4|2| > |1|1|col4|5|1| > |1|1|col4|4|2| > Expected results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|3|1| > |1|1|col3|2|2| > |1|1|col4|4|2| > |1|1|col4|5|1| > But when i remove the last row of the table, i obtain the correct results. > {code:java} > (2,2,2,null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30625) Add `escapeChar` parameter to the `like` function
Maxim Gekk created SPARK-30625: -- Summary: Add `escapeChar` parameter to the `like` function Key: SPARK-30625 URL: https://issues.apache.org/jira/browse/SPARK-30625 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk SPARK-28083 supported LIKE ... ESCAPE syntax {code:sql} spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_'; true {code} but the `like` function can accept only 2 parameters. If we pass the third one, it fails with: {code:sql} spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7 {code} The ticket aims to support the third parameter in `like` as `escapeChar`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables
[ https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022343#comment-17022343 ] Burak Yavuz commented on SPARK-30612: - SPARK-30314 should help make this work easier > can't resolve qualified column name with v2 tables > -- > > Key: SPARK-30612 > URL: https://issues.apache.org/jira/browse/SPARK-30612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > When running queries with qualified columns like `SELECT t.a FROM t`, it > fails to resolve for v2 tables. > v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We > should do the same for v2 tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads
[ https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022342#comment-17022342 ] Min Shen commented on SPARK-29206: -- With more investigation into the Netty side issues, we are addressing this with a different approach in https://issues.apache.org/jira/browse/SPARK-30512. > Number of shuffle Netty server threads should be a multiple of number of > chunk fetch handler threads > > > Key: SPARK-29206 > URL: https://issues.apache.org/jira/browse/SPARK-29206 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Min Shen >Priority: Major > > In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool > to handle the slow-to-process chunk fetch requests in order to improve the > responsiveness of shuffle service for RPC requests. > Initially, we thought by making the number of Netty server threads larger > than the number of chunk fetch handler threads, it would reserve some threads > for RPC requests thus resolving the various RPC request timeout issues we > experienced previously. The solution worked in our cluster initially. > However, as the number of Spark applications in our cluster continues to > increase, we saw the RPC request (SASL authentication specifically) timeout > issue again: > {noformat} > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > {noformat} > After further investigation, we realized that as the number of concurrent > clients connecting to a shuffle service increases, it becomes _VERY_ > important to configure the number of Netty server threads and number of chunk > fetch handler threads correctly. Specifically, the number of Netty server > threads needs to be a multiple of the number of chunk fetch handler threads. > The reason is explained in details below: > When a channel is established on the Netty server, it is registered with both > the Netty server default EventLoopGroup and the chunk fetch handler > EventLoopGroup. Once registered, this channel sticks with a given thread in > both EventLoopGroups, i.e. all requests from this channel is going to be > handled by the same thread. Right now, Spark shuffle Netty server uses the > default Netty strategy to select a thread from a EventLoopGroup to be > associated with a new channel, which is simply round-robin (Netty's > DefaultEventExecutorChooserFactory). > In SPARK-24355, with the introduced chunk fetch handler thread pool, all > chunk fetch requests from a given channel will be first added to the task > queue of the chunk fetch handler thread associated with that channel. When > the requests get processed, the chunk fetch request handler thread will > submit a task to the task queue of the Netty server thread that's also > associated with this channel. If the number of Netty server threads is not a > multiple of the number of chunk fetch handler threads, it would become a > problem when the server has a large number of concurrent connections. > Assume we configure the number of Netty server threads as 40 and the > percentage of chunk fetch handler threads as 87, which leads to 35 chunk > fetch handler threads. Then according to the round-robin policy, channel 0, > 40, 80, 120, 160, 200, 240, and 280 will all be associated with the 1st Netty > server thread in the default EventLoopGroup. However, since the chunk fetch > handler thread pool only has 35 threads, out of these 8 channels, only > channel 0 and 280 will be associated with the same chunk fetch handler > thread. Thus, channel 0, 40, 80, 120, 160, 200, 240 will all be associated > with different chunk fetch handler threads but associated with the same Netty > server thread. This means, the 7 different chunk fetch handler threads > associated with these channels could potentially submit tasks to the task > queue of the same Netty server thread at
[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds
[ https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022296#comment-17022296 ] Sean R. Owen commented on SPARK-30275: -- How is it different from just building the software normally? I get that maybe it pushes the buttons for you to run mvn package, but just weighing that against maintaining yet another integration. > Add gitlab-ci.yml file for reproducible builds > -- > > Key: SPARK-30275 > URL: https://issues.apache.org/jira/browse/SPARK-30275 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jim Kleckner >Priority: Minor > > It would be desirable to have public reproducible builds such as provided by > gitlab or others. > > Here is a candidate patch set to build spark using gitlab-ci: > * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml > Let me know if there is interest in a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28794) Document CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-28794: Assignee: pavithra ramachandran > Document CREATE TABLE in SQL Reference. > --- > > Key: SPARK-28794 > URL: https://issues.apache.org/jira/browse/SPARK-28794 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: pavithra ramachandran >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28794) Document CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-28794. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26759 [https://github.com/apache/spark/pull/26759] > Document CREATE TABLE in SQL Reference. > --- > > Key: SPARK-28794 > URL: https://issues.apache.org/jira/browse/SPARK-28794 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: pavithra ramachandran >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28794) Document CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-28794: - Priority: Minor (was: Major) > Document CREATE TABLE in SQL Reference. > --- > > Key: SPARK-28794 > URL: https://issues.apache.org/jira/browse/SPARK-28794 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: pavithra ramachandran >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30620) avoid unnecessary serialization in AggregateExpression
[ https://issues.apache.org/jira/browse/SPARK-30620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30620. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27342 [https://github.com/apache/spark/pull/27342] > avoid unnecessary serialization in AggregateExpression > -- > > Key: SPARK-30620 > URL: https://issues.apache.org/jira/browse/SPARK-30620 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
[ https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30556: -- Affects Version/s: 2.3.4 > Copy sparkContext.localproperties to child thread > inSubqueryExec.executionContext > - > > Key: SPARK-30556 > URL: https://issues.apache.org/jira/browse/SPARK-30556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Local properties set via sparkContext are not available as TaskContext > properties when executing jobs and threadpools have idle threads which are > reused > Explanation: > When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. > The threads inherit the {{localProperties}} from sparkContext as they are the > child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
[ https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022273#comment-17022273 ] Dongjoon Hyun commented on SPARK-30556: --- Thank you for confirming, [~ajithshetty]. This is backported to branch-2.4 via https://github.com/apache/spark/pull/27340 > Copy sparkContext.localproperties to child thread > inSubqueryExec.executionContext > - > > Key: SPARK-30556 > URL: https://issues.apache.org/jira/browse/SPARK-30556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Local properties set via sparkContext are not available as TaskContext > properties when executing jobs and threadpools have idle threads which are > reused > Explanation: > When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. > The threads inherit the {{localProperties}} from sparkContext as they are the > child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
[ https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30556: -- Fix Version/s: 2.4.5 > Copy sparkContext.localproperties to child thread > inSubqueryExec.executionContext > - > > Key: SPARK-30556 > URL: https://issues.apache.org/jira/browse/SPARK-30556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Local properties set via sparkContext are not available as TaskContext > properties when executing jobs and threadpools have idle threads which are > reused > Explanation: > When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. > The threads inherit the {{localProperties}} from sparkContext as they are the > child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30601) Add a Google Maven Central as a primary repository
[ https://issues.apache.org/jira/browse/SPARK-30601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30601: -- Fix Version/s: 2.4.5 > Add a Google Maven Central as a primary repository > -- > > Key: SPARK-30601 > URL: https://issues.apache.org/jira/browse/SPARK-30601 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > See > [http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html] > This Jira targets to switch the main repo to Google Maven Central. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30624) JDBCV2 with catalog functionalities
Wenchen Fan created SPARK-30624: --- Summary: JDBCV2 with catalog functionalities Key: SPARK-30624 URL: https://issues.apache.org/jira/browse/SPARK-30624 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30623) Spark external shuffle allow disable of separate event loop group
Thomas Graves created SPARK-30623: - Summary: Spark external shuffle allow disable of separate event loop group Key: SPARK-30623 URL: https://issues.apache.org/jira/browse/SPARK-30623 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 2.4.4, 3.0.0 Reporter: Thomas Graves In SPARK-24355 changes were made to add a separate event loop group for processing ChunkFetchRequests , this allow fors the other threads to handle regular connection requests when the configuration value is set. This however seems to have added some latency (see pr 22173 comments at the end). To help with this we could make sure the secondary event loop group isn't used when the configuration of spark.shuffle.server.chunkFetchHandlerThreadsPercent isn't explicitly set. This should result in getting the same behavior as before. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS
[ https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-30557. -- Resolution: Won't Fix > Add public documentation for SPARK_SUBMIT_OPTS > -- > > Key: SPARK-30557 > URL: https://issues.apache.org/jira/browse/SPARK-30557 > Project: Spark > Issue Type: Improvement > Components: Deploy, Documentation >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Priority: Minor > > Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some > documentation. I cannot see it documented > [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS] > in the docs. > How do you use it? What is it useful for? What's an example usage? etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30622) commands should return dummy statistics
Wenchen Fan created SPARK-30622: --- Summary: commands should return dummy statistics Key: SPARK-30622 URL: https://issues.apache.org/jira/browse/SPARK-30622 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30605) move defaultNamespace from SupportsNamespace to CatalogPlugin
[ https://issues.apache.org/jira/browse/SPARK-30605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30605. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27319 [https://github.com/apache/spark/pull/27319] > move defaultNamespace from SupportsNamespace to CatalogPlugin > - > > Key: SPARK-30605 > URL: https://issues.apache.org/jira/browse/SPARK-30605 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30621) Dynamic Pruning thread propagates the localProperties to task
Ajith S created SPARK-30621: --- Summary: Dynamic Pruning thread propagates the localProperties to task Key: SPARK-30621 URL: https://issues.apache.org/jira/browse/SPARK-30621 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Ajith S Local properties set via sparkContext are not available as TaskContext properties when executing parallel jobs and threadpools have idle threads Explanation: When executing parallel jobs via SubqueryBroadcastExec, the {{relationFuture}} is evaluated via a separate thread. The threads inherit the {{localProperties}} from sparkContext as they are the child threads. These threads are controlled via the executionContext (thread pools). Each Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30620) avoid unnecessary serialization in AggregateExpression
Wenchen Fan created SPARK-30620: --- Summary: avoid unnecessary serialization in AggregateExpression Key: SPARK-30620 URL: https://issues.apache.org/jira/browse/SPARK-30620 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30592) Interval support for csv and json functions
[ https://issues.apache.org/jira/browse/SPARK-30592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30592: Description: to_json already supports intervals in Spark 2.4. To be consistent, we should support intervals in from_json, from_csn and to_csv as well. (was: to_csv from_csv to_json from_json) > Interval support for csv and json functions > --- > > Key: SPARK-30592 > URL: https://issues.apache.org/jira/browse/SPARK-30592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > to_json already supports intervals in Spark 2.4. To be consistent, we should > support intervals in from_json, from_csn and to_csv as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30546) Make interval type more future-proof
[ https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30546. - Fix Version/s: 3.0.0 Assignee: Kent Yao Resolution: Fixed > Make interval type more future-proof > > > Key: SPARK-30546 > URL: https://issues.apache.org/jira/browse/SPARK-30546 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > We've decided to not follow the SQL standard to define the interval type in > 3.0. We should try our best to hide intervals from data sources/external > catalogs as much as possible, to not leak internals to external systems. > In Spark 2.4, intervals are exposed in the following places: > 1. The `CalendarIntervalType` is public > 2. `Colum.cast` accepts `CalendarIntervalType` and can cast string to > interval. > 3. `DataFrame.collect` can return `CalendarInterval` objects. > 4. UDF can tale `CalendarInterval` as input. > 5. data sources can return IntervalRow directly which may contain > `CalendarInterval`. > In Spark 3.0, we don't want to break Spark 2.4 applications, but we should > not expose intervals wider than 2.4. In general, we should avoid leaking > intervals to DS v2 and catalog plugins. We should also revert some > PostgresSQL specific interval features. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30546) Make interval type more future-proof
[ https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30546: Description: We've decided to not follow the SQL standard to define the interval type in 3.0. We should try our best to hide intervals from data sources/external catalogs as much as possible, to not leak internals to external systems. In Spark 2.4, intervals are exposed in the following places: 1. The `CalendarIntervalType` is public 2. `Colum.cast` accepts `CalendarIntervalType` and can cast string to interval. 3. `DataFrame.collect` can return `CalendarInterval` objects. 4. UDF can tale `CalendarInterval` as input. 5. data sources can return IntervalRow directly which may contain `CalendarInterval`. In Spark 3.0, we don't want to break Spark 2.4 applications, but we should not expose intervals wider than 2.4. In general, we should avoid leaking intervals to DS v2 and catalog plugins. We should also revert some PostgresSQL specific interval features. was: Before 3.0 we may make some efforts for the current interval type to make it more future-proofing. e.g. 1. add unstable annotation to the CalendarInterval class. People already use it as UDF inputs so it’s better to make it clear it’s unstable. 2. Add a schema checker to prohibit create v2 custom catalog table with intervals, as same as what we do for the builtin catalog 3. Add a schema checker for DataFrameWriterV2 too 4. Make the interval type incomparable as version 2.4 for disambiguation of comparison between year-month and day-time fields 5. The 3.0 newly added to_csv should not support output intervals as same as using CSV file format or make it fully support as normal strings 6. The function to_json should not allow using interval as a key field as same as the value field and JSON datasource, with a legacy config to restore or make it fully support as normal strings 7. Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI, so there is no round trip. > Make interval type more future-proof > > > Key: SPARK-30546 > URL: https://issues.apache.org/jira/browse/SPARK-30546 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > We've decided to not follow the SQL standard to define the interval type in > 3.0. We should try our best to hide intervals from data sources/external > catalogs as much as possible, to not leak internals to external systems. > In Spark 2.4, intervals are exposed in the following places: > 1. The `CalendarIntervalType` is public > 2. `Colum.cast` accepts `CalendarIntervalType` and can cast string to > interval. > 3. `DataFrame.collect` can return `CalendarInterval` objects. > 4. UDF can tale `CalendarInterval` as input. > 5. data sources can return IntervalRow directly which may contain > `CalendarInterval`. > In Spark 3.0, we don't want to break Spark 2.4 applications, but we should > not expose intervals wider than 2.4. In general, we should avoid leaking > intervals to DS v2 and catalog plugins. We should also revert some > PostgresSQL specific interval features. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30546) Make interval type more future-proof
[ https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30546: Summary: Make interval type more future-proof (was: Make interval type more future-proofing) > Make interval type more future-proof > > > Key: SPARK-30546 > URL: https://issues.apache.org/jira/browse/SPARK-30546 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > Before 3.0 we may make some efforts for the current interval type to make it > more future-proofing. e.g. > 1. add unstable annotation to the CalendarInterval class. People already use > it as UDF inputs so it’s better to make it clear it’s unstable. > 2. Add a schema checker to prohibit create v2 custom catalog table with > intervals, as same as what we do for the builtin catalog > 3. Add a schema checker for DataFrameWriterV2 too > 4. Make the interval type incomparable as version 2.4 for disambiguation of > comparison between year-month and day-time fields > 5. The 3.0 newly added to_csv should not support output intervals as same as > using CSV file format or make it fully support as normal strings > 6. The function to_json should not allow using interval as a key field as > same as the value field and JSON datasource, with a legacy config to > restore or make it fully support as normal strings > 7. Revert interval ISO/ANSI SQL Standard output since we decide not to > follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
[ https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021937#comment-17021937 ] Ajith S commented on SPARK-30556: - Yes, it exist in lower version like 2.3.x too > Copy sparkContext.localproperties to child thread > inSubqueryExec.executionContext > - > > Key: SPARK-30556 > URL: https://issues.apache.org/jira/browse/SPARK-30556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > > Local properties set via sparkContext are not available as TaskContext > properties when executing jobs and threadpools have idle threads which are > reused > Explanation: > When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. > The threads inherit the {{localProperties}} from sparkContext as they are the > child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
[ https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021936#comment-17021936 ] Ajith S commented on SPARK-30556: - Raised backport PR for branch 2.4 [https://github.com/apache/spark/pull/27340] > Copy sparkContext.localproperties to child thread > inSubqueryExec.executionContext > - > > Key: SPARK-30556 > URL: https://issues.apache.org/jira/browse/SPARK-30556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > > Local properties set via sparkContext are not available as TaskContext > properties when executing jobs and threadpools have idle threads which are > reused > Explanation: > When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. > The threads inherit the {{localProperties}} from sparkContext as they are the > child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile
Abhishek Rao created SPARK-30619: Summary: org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile Key: SPARK-30619 URL: https://issues.apache.org/jira/browse/SPARK-30619 Project: Spark Issue Type: Question Components: Build Affects Versions: 2.4.4, 2.4.2 Environment: Spark on kubernetes Reporter: Abhishek Rao We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count (org.apache.spark.examples.JavaWordCount) example on local files. But we're seeing that it is expecting org.slf4j.Logger and org.apache.commons.collections classes to be available for executing this. We expected the binary to work as it is for local files. Is there anything which we're missing? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29175) Make maven central repository in IsolatedClientLoader configurable
[ https://issues.apache.org/jira/browse/SPARK-29175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021909#comment-17021909 ] Yuanjian Li commented on SPARK-29175: - Thanks for the review, change the config name in the follow-up: [https://github.com/apache/spark/pull/27339]. > Make maven central repository in IsolatedClientLoader configurable > -- > > Key: SPARK-29175 > URL: https://issues.apache.org/jira/browse/SPARK-29175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > We need to connect a central repository in IsolatedClientLoader for > downloading Hive jars. Here we added a new config > `spark.sql.additionalRemoteRepositories`, a comma-delimited string config of > the optional additional remote maven mirror repositories, it can be used as > the additional remote repositories for the default maven central repo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
[ https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021899#comment-17021899 ] weiwenda commented on SPARK-30617: -- one solution at [https://github.com/apache/spark/pull/27338] > Is there any possible that spark no longer restrict enumerate types of > spark.sql.catalogImplementation > -- > > Key: SPARK-30617 > URL: https://issues.apache.org/jira/browse/SPARK-30617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: weiwenda >Priority: Minor > Fix For: 3.1.0, 2.4.6 > > > # We have implemented a complex ExternalCatalog which is used for retrieving > multi isomerism database's metadata(sush as elasticsearch、postgresql), so > that we can make a mixture query between hive and our online data. > # But as spark require that value of spark.sql.catalogImplementation must be > one of in-memory/hive, we have to modify SparkSession and rebuild spark to > make our project work. > # Finally, we hope spark removing above restriction, so that it's will be > much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30618) Why does SparkSQL allow `WHERE` to be table alias?
Chunjun Xiao created SPARK-30618: Summary: Why does SparkSQL allow `WHERE` to be table alias? Key: SPARK-30618 URL: https://issues.apache.org/jira/browse/SPARK-30618 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.4 Reporter: Chunjun Xiao An empty `WHERE` expression is valid in Spark SQL, as: `SELECT * FROM XXX WHERE`. Here `WHERE` is parsed as the table alias. I think this surprises most SQL users, as this is an invalid statement in some SQL engines like MySQL. I checked the source code, and found more keywords (in most SQL system) are treated as `noReserved` and allowed to be table alias. Could anyone please give the rationality behind this decision? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation
weiwenda created SPARK-30617: Summary: Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation Key: SPARK-30617 URL: https://issues.apache.org/jira/browse/SPARK-30617 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: weiwenda Fix For: 3.1.0, 2.4.6 # We have implemented a complex ExternalCatalog which is used for retrieving multi isomerism database's metadata(sush as elasticsearch、postgresql), so that we can make a mixture query between hive and our online data. # But as spark require that value of spark.sql.catalogImplementation must be one of in-memory/hive, we have to modify SparkSession and rebuild spark to make our project work. # Finally, we hope spark removing above restriction, so that it's will be much easier to let us keep pace with new spark version. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30543) RandomForest add Param bootstrap to control sampling method
[ https://issues.apache.org/jira/browse/SPARK-30543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30543. -- Resolution: Resolved > RandomForest add Param bootstrap to control sampling method > --- > > Key: SPARK-30543 > URL: https://issues.apache.org/jira/browse/SPARK-30543 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > Current RF with numTrees=1 will directly build a tree using the orignial > dataset, > while with numTrees>1 it will use bootstrap samples to build trees. > This design is to train a DecisionTreeModel by the impl of RandomForest, > however, it is somewhat strange. > In Scikit-Learn, there is a param bootstrap to control bootstrap samples are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30543) RandomForest add Param bootstrap to control sampling method
[ https://issues.apache.org/jira/browse/SPARK-30543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30543: Assignee: zhengruifeng > RandomForest add Param bootstrap to control sampling method > --- > > Key: SPARK-30543 > URL: https://issues.apache.org/jira/browse/SPARK-30543 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > Current RF with numTrees=1 will directly build a tree using the orignial > dataset, > while with numTrees>1 it will use bootstrap samples to build trees. > This design is to train a DecisionTreeModel by the impl of RandomForest, > however, it is somewhat strange. > In Scikit-Learn, there is a param bootstrap to control bootstrap samples are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org