[jira] [Commented] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459690#comment-17459690 ] Apache Spark commented on SPARK-37651: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34906 > Use existing active Spark session in all places of pandas API on Spark > -- > > Key: SPARK-37651 > URL: https://issues.apache.org/jira/browse/SPARK-37651 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Use existing active Spark session instead of SparkSession.getOrCreate in all > places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37651: Assignee: (was: Apache Spark) > Use existing active Spark session in all places of pandas API on Spark > -- > > Key: SPARK-37651 > URL: https://issues.apache.org/jira/browse/SPARK-37651 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Use existing active Spark session instead of SparkSession.getOrCreate in all > places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459688#comment-17459688 ] Apache Spark commented on SPARK-37651: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34906 > Use existing active Spark session in all places of pandas API on Spark > -- > > Key: SPARK-37651 > URL: https://issues.apache.org/jira/browse/SPARK-37651 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Use existing active Spark session instead of SparkSession.getOrCreate in all > places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37651: Assignee: Apache Spark > Use existing active Spark session in all places of pandas API on Spark > -- > > Key: SPARK-37651 > URL: https://issues.apache.org/jira/browse/SPARK-37651 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Use existing active Spark session instead of SparkSession.getOrCreate in all > places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32294) GroupedData Pandas UDF 2Gb limit
[ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32294. -- Fix Version/s: 3.3.0 Resolution: Cannot Reproduce This is fixed in the master branch. I can't reproduce it anymore: {code} >>> df = spark.range(1024 * 1024 * 1024 * 1).selectExpr("1 as a", "1 as b", "1 >>> as c").coalesce(1) # More than 2GB to hit ARROW-4890. >>> df.groupby("a").applyInPandas(lambda pdf: pdf, schema=df.schema).count() 1073741824 {code} > GroupedData Pandas UDF 2Gb limit > > > Key: SPARK-32294 > URL: https://issues.apache.org/jira/browse/SPARK-32294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Fix For: 3.3.0 > > > `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for > GroupedData, the whole group is passed to Pandas UDF at once, which can cause > various 2Gb limitations on Arrow side (and in current versions of Arrow, also > 2Gb limitation on Netty allocator side) - > https://issues.apache.org/jira/browse/ARROW-4890 > Would be great to consider feeding GroupedData into a pandas UDF in batches > to solve this issue. > cc [~hyukjin.kwon] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
Xinrong Meng created SPARK-37651: Summary: Use existing active Spark session in all places of pandas API on Spark Key: SPARK-37651 URL: https://issues.apache.org/jira/browse/SPARK-37651 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng Use existing active Spark session instead of SparkSession.getOrCreate in all places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37639) spark history server clean event log directory with out check status file
[ https://issues.apache.org/jira/browse/SPARK-37639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459673#comment-17459673 ] muhong edited comment on SPARK-37639 at 12/15/21, 6:31 AM: --- to solve the problem, i modify the logic in EventLogFileWriters : when invoke method rollEventLogFile, check whether the status file exist first, if not recreate it; it works. {code:java} /** exposed for testing only */ private[history] def rollEventLogFile(): Unit = { // check wether the status file exist, if not recreate it val isExist = checkAppStatusFileExist(true) if(!isExist){ createAppStatusFile(inProgress = true) } closeWriter() index += 1 currentEventLogFilePath = getEventLogFilePath(logDirForAppPath, appId, appAttemptId, index, compressionCodecName) initLogFile(currentEventLogFilePath) { os => countingOutputStream = Some(new CountingOutputStream(os)) new PrintWriter( new OutputStreamWriter(countingOutputStream.get, StandardCharsets.UTF_8)) } } // new check method private def checkAppStatusFileExist(inProgress:Boolean):Boolean = { val appStatusPath = getAppStatusFilePath(logDirForAppPath, appId, appAttempId, inProgress) val isExist = fileSystem.exists(appStatusPath) isExist }{code} why i choose to modify like this, rather than change the logic in history server, because the long run spark thrift server might abnormal exit without change the status file, but this directory still need be delete. during the testing, i found another problem, if the thrift server normal exit with out any eventlog, the history can not delete this kind of directory, because the history server will thrown illegalArgumentException“directory must have at least one eventlog file”。 was (Author: m-sir): to solve the problem, i modify the logic in EventLogFileWriters : when invoke method rollEventLogFile, check whether the status file exist first, if not recreate it; it works. {code:java} /** exposed for testing only */ private[history] def rollEventLogFile(): Unit = { // check wether the status file exist, if not recreate it closeWriter() index += 1 currentEventLogFilePath = getEventLogFilePath(logDirForAppPath, appId, appAttemptId, index, compressionCodecName) initLogFile(currentEventLogFilePath) { os => countingOutputStream = Some(new CountingOutputStream(os)) new PrintWriter( new OutputStreamWriter(countingOutputStream.get, StandardCharsets.UTF_8)) } } {code} why i choose to modify like this, rather than change the logic in history server, because the long run spark thrift server might abnormal exit without change the status file, but this directory still need be delete. during the testing, i found another problem, if the thrift server normal exit with out any eventlog, the history can not delete this kind of directory, because the history server will thrown illegalArgumentException“directory must have at least one eventlog file”。 > spark history server clean event log directory with out check status file > - > > Key: SPARK-37639 > URL: https://issues.apache.org/jira/browse/SPARK-37639 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: muhong >Priority: Major > > i foud a problem, the thrift server create event log file(.inprogress file > create at init), and history server clean the application event log file > according size and modtime. so there is a potential problem under this > situation > *if the thrift server accept no quest long time(longer than time config by > spark.history.fs.cleaner.maxAge), the history server will clean the > applicaiton log [directory] with the inprogress file; after clean the thrift > server accept a lot of request ,and will generate new event log directory > without inprogress status file, and the director will never be clean by > history server because it not contain status file. this will leads spack leak* > i think whenever create new log file , need to check wether the status file > is exist, if not create it > last i think extra function need add, like log4j the compact file stii need > to be clean after a period(config by user),so ,long run spark service like > thrift server‘s event log file space can be limit in a config size -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37639) spark history server clean event log directory with out check status file
[ https://issues.apache.org/jira/browse/SPARK-37639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459673#comment-17459673 ] muhong commented on SPARK-37639: to solve the problem, i modify the logic in EventLogFileWriters : when invoke method rollEventLogFile, check whether the status file exist first, if not recreate it; it works. {code:java} /** exposed for testing only */ private[history] def rollEventLogFile(): Unit = { // check wether the status file exist, if not recreate it closeWriter() index += 1 currentEventLogFilePath = getEventLogFilePath(logDirForAppPath, appId, appAttemptId, index, compressionCodecName) initLogFile(currentEventLogFilePath) { os => countingOutputStream = Some(new CountingOutputStream(os)) new PrintWriter( new OutputStreamWriter(countingOutputStream.get, StandardCharsets.UTF_8)) } } {code} why i choose to modify like this, rather than change the logic in history server, because the long run spark thrift server might abnormal exit without change the status file, but this directory still need be delete. during the testing, i found another problem, if the thrift server normal exit with out any eventlog, the history can not delete this kind of directory, because the history server will thrown illegalArgumentException“directory must have at least one eventlog file”。 > spark history server clean event log directory with out check status file > - > > Key: SPARK-37639 > URL: https://issues.apache.org/jira/browse/SPARK-37639 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: muhong >Priority: Major > > i foud a problem, the thrift server create event log file(.inprogress file > create at init), and history server clean the application event log file > according size and modtime. so there is a potential problem under this > situation > *if the thrift server accept no quest long time(longer than time config by > spark.history.fs.cleaner.maxAge), the history server will clean the > applicaiton log [directory] with the inprogress file; after clean the thrift > server accept a lot of request ,and will generate new event log directory > without inprogress status file, and the director will never be clean by > history server because it not contain status file. this will leads spack leak* > i think whenever create new log file , need to check wether the status file > is exist, if not create it > last i think extra function need add, like log4j the compact file stii need > to be clean after a period(config by user),so ,long run spark service like > thrift server‘s event log file space can be limit in a config size -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-36168) Support Scala 2.13 in `dev/test-dependencies.sh`
[ https://issues.apache.org/jira/browse/SPARK-36168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-36168. - > Support Scala 2.13 in `dev/test-dependencies.sh` > > > Key: SPARK-36168 > URL: https://issues.apache.org/jira/browse/SPARK-36168 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37628) Upgrade Netty from 4.1.68 to 4.1.72
[ https://issues.apache.org/jira/browse/SPARK-37628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-37628: - Description: After SPARK-36729, netty released 3 new versions: * [https://netty.io/news/2021/10/11/4-1-69-Final.html] * [https://netty.io/news/2021/10/11/4-1-70-Final.html] * [https://netty.io/news/2021/12/09/4-1-71-Final.html] * [https://netty.io/news/2021/12/13/4-1-72-Final.html] We can avoid some potential bugs by upgrading to 4.1.72 and 4.1.72 also fixed security issues related to log4j2 in Netty was: After SPARK-36729, netty released 3 new versions: * [https://netty.io/news/2021/10/11/4-1-69-Final.html] * [https://netty.io/news/2021/10/11/4-1-70-Final.html] * [https://netty.io/news/2021/12/09/4-1-71-Final.html] * [https://netty.io/news/2021/12/13/4-1-72-Final.html] We can avoid some potential bugs by upgrading to 4.1.72 > Upgrade Netty from 4.1.68 to 4.1.72 > --- > > Key: SPARK-37628 > URL: https://issues.apache.org/jira/browse/SPARK-37628 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > After SPARK-36729, netty released 3 new versions: > * [https://netty.io/news/2021/10/11/4-1-69-Final.html] > * [https://netty.io/news/2021/10/11/4-1-70-Final.html] > * [https://netty.io/news/2021/12/09/4-1-71-Final.html] > * [https://netty.io/news/2021/12/13/4-1-72-Final.html] > We can avoid some potential bugs by upgrading to 4.1.72 and 4.1.72 also fixed > security issues related to log4j2 in Netty > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37628) Upgrade Netty from 4.1.68 to 4.1.72
[ https://issues.apache.org/jira/browse/SPARK-37628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-37628: - Description: After SPARK-36729, netty released 3 new versions: * [https://netty.io/news/2021/10/11/4-1-69-Final.html] * [https://netty.io/news/2021/10/11/4-1-70-Final.html] * [https://netty.io/news/2021/12/09/4-1-71-Final.html] * [https://netty.io/news/2021/12/13/4-1-72-Final.html] We can avoid some potential bugs by upgrading to 4.1.72 was: After SPARK-36729, netty released 3 new versions: * [https://netty.io/news/2021/10/11/4-1-69-Final.html] * [https://netty.io/news/2021/10/11/4-1-70-Final.html] * [https://netty.io/news/2021/12/09/4-1-71-Final.html] We can avoid some potential bugs by upgrading to 4.1.71 > Upgrade Netty from 4.1.68 to 4.1.72 > --- > > Key: SPARK-37628 > URL: https://issues.apache.org/jira/browse/SPARK-37628 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > After SPARK-36729, netty released 3 new versions: > * [https://netty.io/news/2021/10/11/4-1-69-Final.html] > * [https://netty.io/news/2021/10/11/4-1-70-Final.html] > * [https://netty.io/news/2021/12/09/4-1-71-Final.html] > * [https://netty.io/news/2021/12/13/4-1-72-Final.html] > We can avoid some potential bugs by upgrading to 4.1.72 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37628) Upgrade Netty from 4.1.68 to 4.1.72
[ https://issues.apache.org/jira/browse/SPARK-37628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-37628: - Summary: Upgrade Netty from 4.1.68 to 4.1.72 (was: Upgrade Netty from 4.1.68 to 4.1.71) > Upgrade Netty from 4.1.68 to 4.1.72 > --- > > Key: SPARK-37628 > URL: https://issues.apache.org/jira/browse/SPARK-37628 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > After SPARK-36729, netty released 3 new versions: > * [https://netty.io/news/2021/10/11/4-1-69-Final.html] > * [https://netty.io/news/2021/10/11/4-1-70-Final.html] > * [https://netty.io/news/2021/12/09/4-1-71-Final.html] > We can avoid some potential bugs by upgrading to 4.1.71 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459647#comment-17459647 ] Apache Spark commented on SPARK-37575: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/34905 > null values should be saved as nothing rather than quoted empty Strings "" > with default settings > > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Fix For: 3.3.0 > > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37644) Support datasource v2 complete aggregate pushdown
[ https://issues.apache.org/jira/browse/SPARK-37644 ] jiaan.geng deleted comment on SPARK-37644: was (Author: beliefer): I'm working on. > Support datasource v2 complete aggregate pushdown > -- > > Key: SPARK-37644 > URL: https://issues.apache.org/jira/browse/SPARK-37644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently , Spark supports push down aggregate with partial-agg and final-agg > . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg > by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown
[ https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459639#comment-17459639 ] Apache Spark commented on SPARK-37644: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34904 > Support datasource v2 complete aggregate pushdown > -- > > Key: SPARK-37644 > URL: https://issues.apache.org/jira/browse/SPARK-37644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently , Spark supports push down aggregate with partial-agg and final-agg > . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg > by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37644) Support datasource v2 complete aggregate pushdown
[ https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37644: Assignee: (was: Apache Spark) > Support datasource v2 complete aggregate pushdown > -- > > Key: SPARK-37644 > URL: https://issues.apache.org/jira/browse/SPARK-37644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently , Spark supports push down aggregate with partial-agg and final-agg > . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg > by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37644) Support datasource v2 complete aggregate pushdown
[ https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37644: Assignee: Apache Spark > Support datasource v2 complete aggregate pushdown > -- > > Key: SPARK-37644 > URL: https://issues.apache.org/jira/browse/SPARK-37644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Currently , Spark supports push down aggregate with partial-agg and final-agg > . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg > by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown
[ https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459638#comment-17459638 ] Apache Spark commented on SPARK-37644: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34904 > Support datasource v2 complete aggregate pushdown > -- > > Key: SPARK-37644 > URL: https://issues.apache.org/jira/browse/SPARK-37644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently , Spark supports push down aggregate with partial-agg and final-agg > . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg > by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37575: - Fix Version/s: (was: 3.2.1) > null values should be saved as nothing rather than quoted empty Strings "" > with default settings > > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Fix For: 3.3.0 > > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
[ https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37646. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34901 [https://github.com/apache/spark/pull/34901] > Avoid touching Scala reflection APIs in the lit function > > > Key: SPARK-37646 > URL: https://issues.apache.org/jira/browse/SPARK-37646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.3.0 > > > Currently lit is slow when the concurrency is high as it needs to hit the > Scala reflection code which hits global locks. For example, running the > following test locally using Spark 3.2 shows the difference: > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish)import > org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = > 50def testLiteral(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > new Column(Literal(0L)) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }def testLit(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > lit(0L) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }println("warmup") > testLiteral() > testLit()println("lit: false") > spark.time { > testLiteral() > } > println("lit: true") > spark.time { > testLit() > }// Exiting paste mode, now interpreting.warmup > lit: false > Time taken: 8 ms > lit: true > Time taken: 682 ms > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literal > parallelism: Int = 50 > testLiteral: ()Unit > testLit: ()Unit {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154 ] Byron Hsu deleted comment on SPARK-37154: --- was (Author: byronhsu): i am working on this > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37563) Implement days, seconds, microseconds properties of TimedeltaIndex
[ https://issues.apache.org/jira/browse/SPARK-37563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37563. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34825 [https://github.com/apache/spark/pull/34825] > Implement days, seconds, microseconds properties of TimedeltaIndex > -- > > Key: SPARK-37563 > URL: https://issues.apache.org/jira/browse/SPARK-37563 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Implement days, seconds, microseconds properties of TimedeltaIndex -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37563) Implement days, seconds, microseconds properties of TimedeltaIndex
[ https://issues.apache.org/jira/browse/SPARK-37563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37563: Assignee: Xinrong Meng > Implement days, seconds, microseconds properties of TimedeltaIndex > -- > > Key: SPARK-37563 > URL: https://issues.apache.org/jira/browse/SPARK-37563 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Implement days, seconds, microseconds properties of TimedeltaIndex -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37638) Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37638. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34893 [https://github.com/apache/spark/pull/34893] > Use existing active Spark session instead of SparkSession.getOrCreate in > pandas API on Spark > > > Key: SPARK-37638 > URL: https://issues.apache.org/jira/browse/SPARK-37638 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.3.0 > > > {code} > >>> ps.range(10) > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the > static sql configurations will not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some > spark core configurations may not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the > static sql configurations will not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some > spark core configurations may not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the > static sql configurations will not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some > spark core configurations may not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the > static sql configurations will not take effect. > 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some > spark core configurations may not take effect. > ... >id > 0 0 > 1 1 > 2 2 > 3 3 > 4 4 > 5 5 > 6 6 > 7 7 > 8 8 > 9 9 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37649: Assignee: Hyukjin Kwon > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37649. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34902 [https://github.com/apache/spark/pull/34902] > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459607#comment-17459607 ] Hyukjin Kwon commented on SPARK-24853: -- Oh I meant that it will allow something like: {{withColumnRenamed(from_json(col("a"), col("another_col")))}} where {{from_json(col("a")}} is not actually an existing column. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459601#comment-17459601 ] Nicholas Chammas commented on SPARK-24853: -- Assuming we are talking about the example I provided: Yes, {{col("count")}} would still be ambiguous. I don't know if Spark would know to catch that problem. But note that the current behavior of {{.withColumnRenamed('count', ...)}} renames all columns named "count", which is just incorrect. So allowing {{col("count")}} will either be just as incorrect as the current behavior, or it will be an improvement in that Spark may complain that the column reference is ambiguous. I'd have to try it to confirm the behavior. Of course, the main improvement offered by {{Column}} references is that users can do something like {{.withColumnRenamed(left_counts['count'], ...)}} and get the correct behavior. I didn't follow what you are getting at regarding {{{}from_json{}}}, but does that address your concern? > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37629) speed up Expression.canonicalized
[ https://issues.apache.org/jira/browse/SPARK-37629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37629: Assignee: Wenchen Fan > speed up Expression.canonicalized > - > > Key: SPARK-37629 > URL: https://issues.apache.org/jira/browse/SPARK-37629 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37629) speed up Expression.canonicalized
[ https://issues.apache.org/jira/browse/SPARK-37629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37629. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34883 [https://github.com/apache/spark/pull/34883] > speed up Expression.canonicalized > - > > Key: SPARK-37629 > URL: https://issues.apache.org/jira/browse/SPARK-37629 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37650) Tell spark-env.sh the python interpreter
[ https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37650: Assignee: (was: Apache Spark) > Tell spark-env.sh the python interpreter > > > Key: SPARK-37650 > URL: https://issues.apache.org/jira/browse/SPARK-37650 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Andrew Malone Melo >Priority: Major > > When loading config defaults via spark-env.sh, it can be useful to know > the current pyspark python interpreter to allow the configuration to set > values properly. Pass this value in the environment as > _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script. > h3. What changes were proposed in this pull request? > It's currently possible to set sensible site-wide spark configuration > defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a > user is using pyspark, however, there are a number of things that aren't > discoverable by that script, due to the way that it's called. There is a > chain of calls (java_gateway.py -> shell script -> java -> shell script) that > ends up obliterating any bit of the python context. > This change proposes to add en environment variable > {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the > top-level python executable within pyspark's {{java_gateway.py}} > bootstrapping process. With that, spark-env.sh will be able to infer enough > information about the python environment to set the appropriate configuration > variables. > h3. Why are the changes needed? > Right now, there a number of config options useful to pyspark that can't be > reliably set by {{spark-env.sh}} because it is unaware of the python context > that spawning the executor. To give the most trivial example, it is currently > possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} > based on information readily available from the environment (e.g. the k8s > downward API). However, {{spark.pyspark.python}} and family cannot be set > because when {{spark-env.sh}} executes it's lost all of the python context. > We can instruct users to add the appropriate config variables, but this form > of cargo-culting is error-prone and not scalable. It would be much better to > expose important python variables so that pyspark can not be a second-class > citizen. > h3. Does this PR introduce _any_ user-facing change? > Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will > receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing > to the python executor. > h3. How was this patch tested? > To be perfectly honest, I don't know where this fits into the testing > infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to > java_gateway.py and that works, but in terms of adding this to the CI ... I'm > at a loss. I'm more than willing to add the additional info, if needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37650) Tell spark-env.sh the python interpreter
[ https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459584#comment-17459584 ] Apache Spark commented on SPARK-37650: -- User 'PerilousApricot' has created a pull request for this issue: https://github.com/apache/spark/pull/34903 > Tell spark-env.sh the python interpreter > > > Key: SPARK-37650 > URL: https://issues.apache.org/jira/browse/SPARK-37650 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Andrew Malone Melo >Priority: Major > > When loading config defaults via spark-env.sh, it can be useful to know > the current pyspark python interpreter to allow the configuration to set > values properly. Pass this value in the environment as > _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script. > h3. What changes were proposed in this pull request? > It's currently possible to set sensible site-wide spark configuration > defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a > user is using pyspark, however, there are a number of things that aren't > discoverable by that script, due to the way that it's called. There is a > chain of calls (java_gateway.py -> shell script -> java -> shell script) that > ends up obliterating any bit of the python context. > This change proposes to add en environment variable > {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the > top-level python executable within pyspark's {{java_gateway.py}} > bootstrapping process. With that, spark-env.sh will be able to infer enough > information about the python environment to set the appropriate configuration > variables. > h3. Why are the changes needed? > Right now, there a number of config options useful to pyspark that can't be > reliably set by {{spark-env.sh}} because it is unaware of the python context > that spawning the executor. To give the most trivial example, it is currently > possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} > based on information readily available from the environment (e.g. the k8s > downward API). However, {{spark.pyspark.python}} and family cannot be set > because when {{spark-env.sh}} executes it's lost all of the python context. > We can instruct users to add the appropriate config variables, but this form > of cargo-culting is error-prone and not scalable. It would be much better to > expose important python variables so that pyspark can not be a second-class > citizen. > h3. Does this PR introduce _any_ user-facing change? > Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will > receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing > to the python executor. > h3. How was this patch tested? > To be perfectly honest, I don't know where this fits into the testing > infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to > java_gateway.py and that works, but in terms of adding this to the CI ... I'm > at a loss. I'm more than willing to add the additional info, if needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37650) Tell spark-env.sh the python interpreter
[ https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37650: Assignee: Apache Spark > Tell spark-env.sh the python interpreter > > > Key: SPARK-37650 > URL: https://issues.apache.org/jira/browse/SPARK-37650 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Andrew Malone Melo >Assignee: Apache Spark >Priority: Major > > When loading config defaults via spark-env.sh, it can be useful to know > the current pyspark python interpreter to allow the configuration to set > values properly. Pass this value in the environment as > _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script. > h3. What changes were proposed in this pull request? > It's currently possible to set sensible site-wide spark configuration > defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a > user is using pyspark, however, there are a number of things that aren't > discoverable by that script, due to the way that it's called. There is a > chain of calls (java_gateway.py -> shell script -> java -> shell script) that > ends up obliterating any bit of the python context. > This change proposes to add en environment variable > {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the > top-level python executable within pyspark's {{java_gateway.py}} > bootstrapping process. With that, spark-env.sh will be able to infer enough > information about the python environment to set the appropriate configuration > variables. > h3. Why are the changes needed? > Right now, there a number of config options useful to pyspark that can't be > reliably set by {{spark-env.sh}} because it is unaware of the python context > that spawning the executor. To give the most trivial example, it is currently > possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} > based on information readily available from the environment (e.g. the k8s > downward API). However, {{spark.pyspark.python}} and family cannot be set > because when {{spark-env.sh}} executes it's lost all of the python context. > We can instruct users to add the appropriate config variables, but this form > of cargo-culting is error-prone and not scalable. It would be much better to > expose important python variables so that pyspark can not be a second-class > citizen. > h3. Does this PR introduce _any_ user-facing change? > Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will > receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing > to the python executor. > h3. How was this patch tested? > To be perfectly honest, I don't know where this fits into the testing > infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to > java_gateway.py and that works, but in terms of adding this to the CI ... I'm > at a loss. I'm more than willing to add the additional info, if needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect
[ https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YuanGuanhu updated SPARK-37643: --- Description: This ticket aim at fixing the bug that does not apply right-padding for char types partition column when charVarcharAsString is true and partition data length is less than defined length. For example, a query below returns nothing in master, but a correct result is `abc`. {code:java} scala> sql("set spark.sql.legacy.charVarcharAsString=true") scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by (c)") scala> sql("INSERT INTO tb01 values(1, 'abc')") scala> sql("select c from tb01 where c = 'abc'").show +---+ | c| +---+ +---+{code} This is because `ApplyCharTypePadding` rpad the expr to charLength. We should handle this consider conf spark.sql.legacy.charVarcharAsString value. was: This ticket aim at fixing the bug that does not apply right-padding for char types partition column when charVarcharAsString is true and partition data length is lower than defined length. For example, a query below returns nothing in master, but a correct result is `abc`. {code:java} scala> sql("set spark.sql.legacy.charVarcharAsString=true") scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by (c)") scala> sql("INSERT INTO tb01 values(1, 'abc')") scala> sql("select c from tb01 where c = 'abc'").show +---+ | c| +---+ +---+{code} This is because `ApplyCharTypePadding` rpad the expr to charLength. We should handle this consider conf spark.sql.legacy.charVarcharAsString value. > when charVarcharAsString is true, char datatype partition table query > incorrect > --- > > Key: SPARK-37643 > URL: https://issues.apache.org/jira/browse/SPARK-37643 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 > Environment: spark 3.2.0 >Reporter: YuanGuanhu >Priority: Major > > This ticket aim at fixing the bug that does not apply right-padding for char > types partition column when charVarcharAsString is true and partition data > length is less than defined length. > For example, a query below returns nothing in master, but a correct result is > `abc`. > {code:java} > scala> sql("set spark.sql.legacy.charVarcharAsString=true") > scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned > by (c)") > scala> sql("INSERT INTO tb01 values(1, 'abc')") > scala> sql("select c from tb01 where c = 'abc'").show > +---+ > | c| > +---+ > +---+{code} > This is because `ApplyCharTypePadding` rpad the expr to charLength. We should > handle this consider conf spark.sql.legacy.charVarcharAsString value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37650) Tell spark-env.sh the python interpreter
Andrew Malone Melo created SPARK-37650: -- Summary: Tell spark-env.sh the python interpreter Key: SPARK-37650 URL: https://issues.apache.org/jira/browse/SPARK-37650 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Andrew Malone Melo When loading config defaults via spark-env.sh, it can be useful to know the current pyspark python interpreter to allow the configuration to set values properly. Pass this value in the environment as _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script. h3. What changes were proposed in this pull request? It's currently possible to set sensible site-wide spark configuration defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a user is using pyspark, however, there are a number of things that aren't discoverable by that script, due to the way that it's called. There is a chain of calls (java_gateway.py -> shell script -> java -> shell script) that ends up obliterating any bit of the python context. This change proposes to add en environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the top-level python executable within pyspark's {{java_gateway.py}} bootstrapping process. With that, spark-env.sh will be able to infer enough information about the python environment to set the appropriate configuration variables. h3. Why are the changes needed? Right now, there a number of config options useful to pyspark that can't be reliably set by {{spark-env.sh}} because it is unaware of the python context that spawning the executor. To give the most trivial example, it is currently possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} based on information readily available from the environment (e.g. the k8s downward API). However, {{spark.pyspark.python}} and family cannot be set because when {{spark-env.sh}} executes it's lost all of the python context. We can instruct users to add the appropriate config variables, but this form of cargo-culting is error-prone and not scalable. It would be much better to expose important python variables so that pyspark can not be a second-class citizen. h3. Does this PR introduce _any_ user-facing change? Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing to the python executor. h3. How was this patch tested? To be perfectly honest, I don't know where this fits into the testing infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to java_gateway.py and that works, but in terms of adding this to the CI ... I'm at a loss. I'm more than willing to add the additional info, if needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459579#comment-17459579 ] Hyukjin Kwon commented on SPARK-24853: -- [~nchammas], actually I don't think this can truly handle the duplicate column names ... allowing column will expose a possibility such as col("count") or other expressions such as from_json, etc. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459578#comment-17459578 ] Apache Spark commented on SPARK-37649: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34902 > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37649: Assignee: Apache Spark > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37649: Assignee: (was: Apache Spark) > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter shoul
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37274. -- Resolution: Won't Fix > When the value of this parameter is greater than the maximum value of int > type, the value will be thrown out of bounds. The document description of > this parameter should remind the user of this risk point > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > When the value of this parameter is greater than the maximum value of int > type, the value will be thrown out of bounds. The document description of > this parameter should remind the user of this risk point -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37626. -- Fix Version/s: (was: 3.3.0) Resolution: Duplicate > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
Hyukjin Kwon created SPARK-37649: Summary: Switch default index to distributed-sequence by default in pandas API on Spark Key: SPARK-37649 URL: https://issues.apache.org/jira/browse/SPARK-37649 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Hyukjin Kwon pandas API on Spark currently sets {{compute.default_index_type}} to {{sequence}} which relies on sending all data to one executor that easily causes OOM. We should better switch to {{distributed-sequence}} type that truly distributes the data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36168) Support Scala 2.13 in `dev/test-dependencies.sh`
[ https://issues.apache.org/jira/browse/SPARK-36168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-36168. -- Resolution: Not A Problem > Support Scala 2.13 in `dev/test-dependencies.sh` > > > Key: SPARK-36168 > URL: https://issues.apache.org/jira/browse/SPARK-36168 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36584) ExecutorMonitor#onBlockUpdated will receive event from driver
[ https://issues.apache.org/jira/browse/SPARK-36584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-36584. -- Resolution: Invalid > ExecutorMonitor#onBlockUpdated will receive event from driver > - > > Key: SPARK-36584 > URL: https://issues.apache.org/jira/browse/SPARK-36584 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.1.2 > Environment: Spark 3.1.2 >Reporter: 胡振宇 >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > When driver broadcast object, it will send the > [SparkListenerBlockUpdated|https://github.com/apache/spark/blob/df0ec56723f0b47c3629055fa7a8c63bb4285147/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L228] > event. > [ExecutorMonitor#onBlockUpdated|https://github.com/apache/spark/blob/df0ec56723f0b47c3629055fa7a8c63bb4285147/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L380] > receives and handles the event, in this method, > it calls > [ensureExecutorIsTracked|https://github.com/apache/spark/blob/df0ec56723f0b47c3629055fa7a8c63bb4285147/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L489] > to put driver in `executors` variable with `UNKNOWN_RESOURCE_PROFILE_ID`. In > my understanding, `ExecutorMonitor` should only monitor executor not driver. > Although this will not cause any problems at the moment because > UNKNOWN_RESOURCE_PROFILE_ID will be filtered out, but I think this is a > potential risk -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37620) Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm)
[ https://issues.apache.org/jira/browse/SPARK-37620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-37620: --- Parent: SPARK-37094 Issue Type: Sub-task (was: Improvement) > Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm) > -- > > Key: SPARK-37620 > URL: https://issues.apache.org/jira/browse/SPARK-37620 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > As a part of SPARK-37152 [we > agreed|https://github.com/apache/spark/pull/34466/files#r762609181] to keep > these typed as not {{Optional}} for simplicity, but this just a temporary > solution to move things forward, until we decide on best approach to handle > such cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37630. -- Resolution: Duplicate > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459567#comment-17459567 ] Sean R. Owen commented on SPARK-37630: -- Spark depends on a whole lot of other libraries, and they use log4j 1.x, like Hadoop. That's most of the issue. Spark doesn't really care about the logging framework, though it touches log4j -- mostly to configure logs from other libraries. Anyone can try to fix it, but, it's harder than it sounds. You'd have to figure out how to plumb log4j 1.x calls to something else and exclude all log4j 1.x dependencies. This is a duplicate of other JIRAs > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37625) update log4j to 2.15
[ https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37625. -- Resolution: Duplicate > update log4j to 2.15 > - > > Key: SPARK-37625 > URL: https://issues.apache.org/jira/browse/SPARK-37625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: weifeng zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459552#comment-17459552 ] Tim Sanders commented on SPARK-26404: - I'm running into this as well. I'm on Spark 3.2.0, not using Kubernetes. Setting {{spark.pyspark.python}} via {{SparkSession.builder.config}} has no effect, but setting {{os.environ['PYSPARK_PYTHON']}} works as expected. Setting {{PYSPARK_PYTHON}} inside of {{spark-env.sh}} does _not_ seem to work with {{SparkSession}}, but it _does_ work with {{spark-submit}}. Setting {{spark.pyspark.python}} inside of {{spark-defaults.conf}} does seem to work for both {{spark-submit}} and {{SparkSession}}, but it doesn't help my use case as I want to change the selected version at runtime based on the version that the client is running (without maintaining multiple configs). I agree with [~vpadulan], not sure why this was marked as "Not a Problem". Any chance we can get this re-opened? > set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster > mode. > --- > > Key: SPARK-26404 > URL: https://issues.apache.org/jira/browse/SPARK-26404 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Dongqing Liu >Priority: Major > > Neither > conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python") > nor > conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") > works. > Looks like the executor always picks python from PATH. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37146) Inline type hints for python/pyspark/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-37146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37146. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34433 [https://github.com/apache/spark/pull/34433] > Inline type hints for python/pyspark/__init__.py > > > Key: SPARK-37146 > URL: https://issues.apache.org/jira/browse/SPARK-37146 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37146) Inline type hints for python/pyspark/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-37146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37146: -- Assignee: dch nguyen > Inline type hints for python/pyspark/__init__.py > > > Key: SPARK-37146 > URL: https://issues.apache.org/jira/browse/SPARK-37146 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37648) Spark catalog and Delta tables
Hanna Liashchuk created SPARK-37648: --- Summary: Spark catalog and Delta tables Key: SPARK-37648 URL: https://issues.apache.org/jira/browse/SPARK-37648 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Environment: Spark version 3.1.2 Scala version 2.12.10 Hive version 2.3.7 Delta version 1.0.0 Reporter: Hanna Liashchuk I'm using Spark with Delta tables, while tables are created, there are no columns in the table. Steps to reproduce: 1. Start spark-shell {code:java} spark-shell --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code} 2. Create delta table {code:java} spark.range(10).write.format("delta").option("path", "tmp/delta").saveAsTable("delta"){code} 3. Make sure table exists {code:java} spark.catalog.listTables.show{code} 4. Find out that columns are not {code:java} spark.catalog.listColumns("delta").show{code} This is critical for Delta integration with different BI tools such as Power BI or Tableau, as they are querying spark catalog for the metadata and we are getting errors that no columns are found. Discussion can be found in Delta repository - https://github.com/delta-io/delta/issues/695 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-25150. -- Fix Version/s: 3.2.0 Resolution: Fixed It looks like Spark 3.1.2 exhibits a different sort of broken behavior: {code:java} pyspark.sql.utils.AnalysisException: Column State#38 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. {code} I don't think the join in {{zombie-analysis.py}} is ambiguous, and since this now works fine in Spark 3.2.0, that's what I'm going to mark as the "Fix Version" for this issue. The fix must have made it in somewhere between Spark 3.1.2 and 3.2.0. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: Nicholas Chammas >Priority: Major > Labels: correctness > Fix For: 3.2.0 > > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459494#comment-17459494 ] Nicholas Chammas commented on SPARK-25150: -- I re-ran my test (described in the issue description + summarized in my comment just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with cross joins enabled or disabled, I now get the correct results. Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran this test) was responsible for the fix. But to be clear, in case anyone wants to reproduce my test: # Download all 6 files attached to this issue into a folder. # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and inspect the output. # Then, enable cross joins (commented out on line 9), rerun the script, and reinspect the output. # Compare the final bit of output from both runs against {{{}expected-output.txt{}}}. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: Nicholas Chammas >Priority: Major > Labels: correctness > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459467#comment-17459467 ] Nicholas Chammas commented on SPARK-24853: -- [~hyukjin.kwon] - Are you still opposed to this proposed improvement? If not, I'd like to work on it. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-26589. -- Resolution: Won't Fix Marking this as "Won't Fix", but I suppose if someone really wanted to, they could reopen this issue and propose adding a median function that is simply an alias for {{{}percentile(col, 0.5){}}}. Don't know how the committers would feel about that. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37645: -- Component/s: Tests > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37645: -- Affects Version/s: 3.3.0 (was: 3.1.0) (was: 3.2.0) (was: 3.1.1) > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459455#comment-17459455 ] Nicholas Chammas commented on SPARK-26589: -- It looks like making a distributed, memory-efficient implementation of median is not possible using the design of Catalyst as it stands today. For more details, please see [this thread on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3cCAOhmDzev8d4H20XT1hUP9us=cpjeysgcf+xev7lg7dka1gj...@mail.gmail.com%3e]. It's possible to get an exact median today by using {{{}percentile(col, 0.5){}}}, which is [available via the SQL API|https://spark.apache.org/docs/3.2.0/sql-ref-functions-builtin.html#aggregate-functions]. It's not memory-efficient, so it may not work well on large datasets. The Python and Scala DataFrame APIs do not offer this exact percentile function, so I've filed SPARK-37647 to track exposing this function in those APIs. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37647) Expose percentile function in Scala/Python APIs
Nicholas Chammas created SPARK-37647: Summary: Expose percentile function in Scala/Python APIs Key: SPARK-37647 URL: https://issues.apache.org/jira/browse/SPARK-37647 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.2.0 Reporter: Nicholas Chammas SQL offers a percentile function (exact, not approximate) that is not available directly in the Scala or Python DataFrame APIs. While it is possible to invoke SQL functions from Scala or Python via {{{}expr(){}}}, I think most users expect function parity across Scala, Python, and SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37645: - Assignee: qian > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: qian >Assignee: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37645. --- Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/34899 > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37217: - Fix Version/s: 3.2.1 > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.2.1, 3.3.0 > > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
[ https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37646: Assignee: Shixiong Zhu (was: Apache Spark) > Avoid touching Scala reflection APIs in the lit function > > > Key: SPARK-37646 > URL: https://issues.apache.org/jira/browse/SPARK-37646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently lit is slow when the concurrency is high as it needs to hit the > Scala reflection code which hits global locks. For example, running the > following test locally using Spark 3.2 shows the difference: > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish)import > org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = > 50def testLiteral(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > new Column(Literal(0L)) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }def testLit(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > lit(0L) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }println("warmup") > testLiteral() > testLit()println("lit: false") > spark.time { > testLiteral() > } > println("lit: true") > spark.time { > testLit() > }// Exiting paste mode, now interpreting.warmup > lit: false > Time taken: 8 ms > lit: true > Time taken: 682 ms > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literal > parallelism: Int = 50 > testLiteral: ()Unit > testLit: ()Unit {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
[ https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37646: Assignee: Apache Spark (was: Shixiong Zhu) > Avoid touching Scala reflection APIs in the lit function > > > Key: SPARK-37646 > URL: https://issues.apache.org/jira/browse/SPARK-37646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Currently lit is slow when the concurrency is high as it needs to hit the > Scala reflection code which hits global locks. For example, running the > following test locally using Spark 3.2 shows the difference: > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish)import > org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = > 50def testLiteral(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > new Column(Literal(0L)) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }def testLit(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > lit(0L) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }println("warmup") > testLiteral() > testLit()println("lit: false") > spark.time { > testLiteral() > } > println("lit: true") > spark.time { > testLit() > }// Exiting paste mode, now interpreting.warmup > lit: false > Time taken: 8 ms > lit: true > Time taken: 682 ms > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literal > parallelism: Int = 50 > testLiteral: ()Unit > testLit: ()Unit {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
[ https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459394#comment-17459394 ] Apache Spark commented on SPARK-37646: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/34901 > Avoid touching Scala reflection APIs in the lit function > > > Key: SPARK-37646 > URL: https://issues.apache.org/jira/browse/SPARK-37646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently lit is slow when the concurrency is high as it needs to hit the > Scala reflection code which hits global locks. For example, running the > following test locally using Spark 3.2 shows the difference: > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish)import > org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = > 50def testLiteral(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > new Column(Literal(0L)) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }def testLit(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > lit(0L) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }println("warmup") > testLiteral() > testLit()println("lit: false") > spark.time { > testLiteral() > } > println("lit: true") > spark.time { > testLit() > }// Exiting paste mode, now interpreting.warmup > lit: false > Time taken: 8 ms > lit: true > Time taken: 682 ms > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literal > parallelism: Int = 50 > testLiteral: ()Unit > testLit: ()Unit {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
[ https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-37646: Assignee: Shixiong Zhu > Avoid touching Scala reflection APIs in the lit function > > > Key: SPARK-37646 > URL: https://issues.apache.org/jira/browse/SPARK-37646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently lit is slow when the concurrency is high as it needs to hit the > Scala reflection code which hits global locks. For example, running the > following test locally using Spark 3.2 shows the difference: > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish)import > org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = > 50def testLiteral(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > new Column(Literal(0L)) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }def testLit(): Unit = { > val ts = for (_ <- 0 until parallelism) yield { > new Thread() { > override def run() { > for (_ <- 0 until 50) { > lit(0L) > } > } > } > } > ts.foreach(_.start()) > ts.foreach(_.join()) > }println("warmup") > testLiteral() > testLit()println("lit: false") > spark.time { > testLiteral() > } > println("lit: true") > spark.time { > testLit() > }// Exiting paste mode, now interpreting.warmup > lit: false > Time taken: 8 ms > lit: true > Time taken: 682 ms > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.Column > import org.apache.spark.sql.catalyst.expressions.Literal > parallelism: Int = 50 > testLiteral: ()Unit > testLit: ()Unit {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function
Shixiong Zhu created SPARK-37646: Summary: Avoid touching Scala reflection APIs in the lit function Key: SPARK-37646 URL: https://issues.apache.org/jira/browse/SPARK-37646 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Shixiong Zhu Currently lit is slow when the concurrency is high as it needs to hit the Scala reflection code which hits global locks. For example, running the following test locally using Spark 3.2 shows the difference: {code:java} scala> :paste // Entering paste mode (ctrl-D to finish)import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 50def testLiteral(): Unit = { val ts = for (_ <- 0 until parallelism) yield { new Thread() { override def run() { for (_ <- 0 until 50) { new Column(Literal(0L)) } } } } ts.foreach(_.start()) ts.foreach(_.join()) }def testLit(): Unit = { val ts = for (_ <- 0 until parallelism) yield { new Thread() { override def run() { for (_ <- 0 until 50) { lit(0L) } } } } ts.foreach(_.start()) ts.foreach(_.join()) }println("warmup") testLiteral() testLit()println("lit: false") spark.time { testLiteral() } println("lit: true") spark.time { testLit() }// Exiting paste mode, now interpreting.warmup lit: false Time taken: 8 ms lit: true Time taken: 682 ms import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.expressions.Literal parallelism: Int = 50 testLiteral: ()Unit testLit: ()Unit {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment
[ https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459286#comment-17459286 ] Aleksey Tsalolikhin commented on SPARK-28884: - I've run into this as well, on Spark 3.1.2. I launched in "cluster" deploy mode, with driver cores set to 32 but the UI shows 0: !Screen Shot 2021-12-14 at 8.31.23 AM.png! > [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode > deployment > - > > Key: SPARK-28884 > URL: https://issues.apache.org/jira/browse/SPARK-28884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Core0.png, Core1.png, Screen Shot 2021-12-14 at 8.31.23 > AM.png, Screen+Shot+2021-12-09+at+9.39.58+AM.png > > > Launch spark sql in local and yarn mode > bin/spark-shell --master local > bin/spark-shell --master yarn( client and cluster both) > vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn > --deploy-mode cluster --class org.apache.spark.examples.SparkPi > /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10 > vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn > --deploy-mode client --class org.apache.spark.examples.SparkPi > /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10 > Open UI and check the driver core it display 0 but in local is display 1. > Expectation: It should display 1 by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment
[ https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Tsalolikhin updated SPARK-28884: Attachment: Screen Shot 2021-12-14 at 8.31.23 AM.png > [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode > deployment > - > > Key: SPARK-28884 > URL: https://issues.apache.org/jira/browse/SPARK-28884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Core0.png, Core1.png, Screen Shot 2021-12-14 at 8.31.23 > AM.png, Screen+Shot+2021-12-09+at+9.39.58+AM.png > > > Launch spark sql in local and yarn mode > bin/spark-shell --master local > bin/spark-shell --master yarn( client and cluster both) > vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn > --deploy-mode cluster --class org.apache.spark.examples.SparkPi > /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10 > vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn > --deploy-mode client --class org.apache.spark.examples.SparkPi > /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10 > Open UI and check the driver core it display 0 but in local is display 1. > Expectation: It should display 1 by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment
[ https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Tsalolikhin updated SPARK-28884: Attachment: Screen+Shot+2021-12-09+at+9.39.58+AM.png > [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode > deployment > - > > Key: SPARK-28884 > URL: https://issues.apache.org/jira/browse/SPARK-28884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Core0.png, Core1.png, Screen Shot 2021-12-14 at 8.31.23 > AM.png, Screen+Shot+2021-12-09+at+9.39.58+AM.png > > > Launch spark sql in local and yarn mode > bin/spark-shell --master local > bin/spark-shell --master yarn( client and cluster both) > vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn > --deploy-mode cluster --class org.apache.spark.examples.SparkPi > /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10 > vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn > --deploy-mode client --class org.apache.spark.examples.SparkPi > /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10 > Open UI and check the driver core it display 0 but in local is display 1. > Expectation: It should display 1 by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect
[ https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37643: Assignee: Apache Spark > when charVarcharAsString is true, char datatype partition table query > incorrect > --- > > Key: SPARK-37643 > URL: https://issues.apache.org/jira/browse/SPARK-37643 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 > Environment: spark 3.2.0 >Reporter: YuanGuanhu >Assignee: Apache Spark >Priority: Major > > This ticket aim at fixing the bug that does not apply right-padding for char > types partition column when charVarcharAsString is true and partition data > length is lower than defined length. > For example, a query below returns nothing in master, but a correct result is > `abc`. > {code:java} > scala> sql("set spark.sql.legacy.charVarcharAsString=true") > scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned > by (c)") > scala> sql("INSERT INTO tb01 values(1, 'abc')") > scala> sql("select c from tb01 where c = 'abc'").show > +---+ > | c| > +---+ > +---+{code} > This is because `ApplyCharTypePadding` rpad the expr to charLength. We should > handle this consider conf spark.sql.legacy.charVarcharAsString value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect
[ https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459273#comment-17459273 ] Apache Spark commented on SPARK-37643: -- User 'fhygh' has created a pull request for this issue: https://github.com/apache/spark/pull/34900 > when charVarcharAsString is true, char datatype partition table query > incorrect > --- > > Key: SPARK-37643 > URL: https://issues.apache.org/jira/browse/SPARK-37643 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 > Environment: spark 3.2.0 >Reporter: YuanGuanhu >Priority: Major > > This ticket aim at fixing the bug that does not apply right-padding for char > types partition column when charVarcharAsString is true and partition data > length is lower than defined length. > For example, a query below returns nothing in master, but a correct result is > `abc`. > {code:java} > scala> sql("set spark.sql.legacy.charVarcharAsString=true") > scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned > by (c)") > scala> sql("INSERT INTO tb01 values(1, 'abc')") > scala> sql("select c from tb01 where c = 'abc'").show > +---+ > | c| > +---+ > +---+{code} > This is because `ApplyCharTypePadding` rpad the expr to charLength. We should > handle this consider conf spark.sql.legacy.charVarcharAsString value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect
[ https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37643: Assignee: (was: Apache Spark) > when charVarcharAsString is true, char datatype partition table query > incorrect > --- > > Key: SPARK-37643 > URL: https://issues.apache.org/jira/browse/SPARK-37643 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 > Environment: spark 3.2.0 >Reporter: YuanGuanhu >Priority: Major > > This ticket aim at fixing the bug that does not apply right-padding for char > types partition column when charVarcharAsString is true and partition data > length is lower than defined length. > For example, a query below returns nothing in master, but a correct result is > `abc`. > {code:java} > scala> sql("set spark.sql.legacy.charVarcharAsString=true") > scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned > by (c)") > scala> sql("INSERT INTO tb01 values(1, 'abc')") > scala> sql("select c from tb01 where c = 'abc'").show > +---+ > | c| > +---+ > +---+{code} > This is because `ApplyCharTypePadding` rpad the expr to charLength. We should > handle this consider conf spark.sql.legacy.charVarcharAsString value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37592) Improve performance of JoinSelection
[ https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37592. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34844 [https://github.com/apache/spark/pull/34844] > Improve performance of JoinSelection > > > Key: SPARK-37592 > URL: https://issues.apache.org/jira/browse/SPARK-37592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > When I reading the implement of AQE, I find the process select join with hint > exists a lot cumbersome code. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37592) Improve performance of JoinSelection
[ https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37592: --- Assignee: jiaan.geng > Improve performance of JoinSelection > > > Key: SPARK-37592 > URL: https://issues.apache.org/jira/browse/SPARK-37592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > When I reading the implement of AQE, I find the process select join with hint > exists a lot cumbersome code. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459180#comment-17459180 ] Apache Spark commented on SPARK-37645: -- User 'dcoliversun' has created a pull request for this issue: https://github.com/apache/spark/pull/34899 > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37645: Assignee: Apache Spark > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: qian >Assignee: Apache Spark >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37645: Assignee: (was: Apache Spark) > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37645) Word spell error - "labeled" spells as "labled"
qian created SPARK-37645: Summary: Word spell error - "labeled" spells as "labled" Key: SPARK-37645 URL: https://issues.apache.org/jira/browse/SPARK-37645 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.2.0, 3.1.1, 3.1.0 Reporter: qian Fix For: 3.3.0 Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37567) reuse Exchange failed
[ https://issues.apache.org/jira/browse/SPARK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37567. -- Resolution: Not A Problem > reuse Exchange failed > -- > > Key: SPARK-37567 > URL: https://issues.apache.org/jira/browse/SPARK-37567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: junbiao chen >Priority: Major > Labels: bugfix, performance, pull-request-available > Attachments: execution stage(1)-query2.png, execution > stage-query2.png, physical plan-query2.png > > > PR available:https://github.com/apache/spark/pull/34858 > use case:query2 in TPC-DS.There are three exchange subquery will scan the > same table "store_sales" in logical plan,these subqueries meet exchange reuse > rule.I confirm that the exchange use rule work in physical plan.But when > spark execute the physical plan,I find out > exchange reuse failed,reused exchange has been executed twice. > physical plan: > !physical plan-query2.png! > > execution stages: > !execution stage-query2.png! > > !execution stage(1)-query2.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459135#comment-17459135 ] Apache Spark commented on SPARK-37568: -- User 'yoda-mon' has created a pull request for this issue: https://github.com/apache/spark/pull/34896 > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37568: Assignee: Apache Spark > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37568: Assignee: (was: Apache Spark) > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown
[ https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459117#comment-17459117 ] jiaan.geng commented on SPARK-37644: I'm working on. > Support datasource v2 complete aggregate pushdown > -- > > Key: SPARK-37644 > URL: https://issues.apache.org/jira/browse/SPARK-37644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently , Spark supports push down aggregate with partial-agg and final-agg > . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg > by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37644) Support datasource v2 complete aggregate pushdown
jiaan.geng created SPARK-37644: -- Summary: Support datasource v2 complete aggregate pushdown Key: SPARK-37644 URL: https://issues.apache.org/jira/browse/SPARK-37644 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng Currently , Spark supports push down aggregate with partial-agg and final-agg . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg by running completely on database. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37635) SHOW TBLPROPERTIES should print the fully qualified table name
[ https://issues.apache.org/jira/browse/SPARK-37635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-37635. Fix Version/s: 3.3.0 Assignee: Wenchen Fan Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34890 > SHOW TBLPROPERTIES should print the fully qualified table name > -- > > Key: SPARK-37635 > URL: https://issues.apache.org/jira/browse/SPARK-37635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37310) Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-37310. Fix Version/s: 3.3.0 Assignee: Terry Kim (was: Apache Spark) Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34891 > Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default > --- > > Key: SPARK-37310 > URL: https://issues.apache.org/jira/browse/SPARK-37310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.3.0 > > > Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37641) Support ANSI Aggregate Function: regr_r2
[ https://issues.apache.org/jira/browse/SPARK-37641 ] jiaan.geng deleted comment on SPARK-37641: was (Author: beliefer): I'm working on. > Support ANSI Aggregate Function: regr_r2 > > > Key: SPARK-37641 > URL: https://issues.apache.org/jira/browse/SPARK-37641 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_R2 is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459018#comment-17459018 ] Ismail H commented on SPARK-37630: -- Sean, thanks for the comment. Since this is my first contribution in the community, I might have gotten some of the description wrong. My bad * "The problem is Spark dependencies that need log4j 1.x. " Are you saying that those dependencies are different from Spark Core itself ? is there a Spark "internal component" that uses log4j ? * Regarding the "it'd be nice to update to log4j 2(.15)", can anyone contribute and try resolving it or is it part of a specific roadmap ? > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismail H updated SPARK-37630: - Summary: Security issue from Log4j 1.X exploit (was: Security issue from Log4j 0day exploit) > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect
[ https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YuanGuanhu updated SPARK-37643: --- Description: This ticket aim at fixing the bug that does not apply right-padding for char types partition column when charVarcharAsString is true and partition data length is lower than defined length. For example, a query below returns nothing in master, but a correct result is `abc`. {code:java} scala> sql("set spark.sql.legacy.charVarcharAsString=true") scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by (c)") scala> sql("INSERT INTO tb01 values(1, 'abc')") scala> sql("select c from tb01 where c = 'abc'").show +---+ | c| +---+ +---+{code} This is because `ApplyCharTypePadding` rpad the expr to charLength. We should handle this consider conf spark.sql.legacy.charVarcharAsString value. was: This ticket aim at fixing the bug that does not apply right-padding for char types partition column when charVarcharAsString is true and partition data length is lower than defined length. For example, a query below returns nothing in master, but a correct result is `abc`. {code:java} scala> sql("set spark.sql.legacy.charVarcharAsString=true") scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by (c)") scala> sql("INSERT INTO tb01 values(1, 'abc')") scala> sql("select * from tb01 where c = 'abc'").show +---+ | c| +---+ +---+{code} This is because `ApplyCharTypePadding` rpad the expr to charLength. We should handle this consider conf spark.sql.legacy.charVarcharAsString value. > when charVarcharAsString is true, char datatype partition table query > incorrect > --- > > Key: SPARK-37643 > URL: https://issues.apache.org/jira/browse/SPARK-37643 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 > Environment: spark 3.2.0 >Reporter: YuanGuanhu >Priority: Major > > This ticket aim at fixing the bug that does not apply right-padding for char > types partition column when charVarcharAsString is true and partition data > length is lower than defined length. > For example, a query below returns nothing in master, but a correct result is > `abc`. > {code:java} > scala> sql("set spark.sql.legacy.charVarcharAsString=true") > scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned > by (c)") > scala> sql("INSERT INTO tb01 values(1, 'abc')") > scala> sql("select c from tb01 where c = 'abc'").show > +---+ > | c| > +---+ > +---+{code} > This is because `ApplyCharTypePadding` rpad the expr to charLength. We should > handle this consider conf spark.sql.legacy.charVarcharAsString value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect
YuanGuanhu created SPARK-37643: -- Summary: when charVarcharAsString is true, char datatype partition table query incorrect Key: SPARK-37643 URL: https://issues.apache.org/jira/browse/SPARK-37643 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.2 Environment: spark 3.2.0 Reporter: YuanGuanhu This ticket aim at fixing the bug that does not apply right-padding for char types partition column when charVarcharAsString is true and partition data length is lower than defined length. For example, a query below returns nothing in master, but a correct result is `abc`. {code:java} scala> sql("set spark.sql.legacy.charVarcharAsString=true") scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by (c)") scala> sql("INSERT INTO tb01 values(1, 'abc')") scala> sql("select * from tb01 where c = 'abc'").show +---+ | c| +---+ +---+{code} This is because `ApplyCharTypePadding` rpad the expr to charLength. We should handle this consider conf spark.sql.legacy.charVarcharAsString value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6305: --- Assignee: (was: Apache Spark) > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6305: --- Assignee: Apache Spark > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Assignee: Apache Spark >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459007#comment-17459007 ] Apache Spark commented on SPARK-6305: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/34895 > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37641) Support ANSI Aggregate Function: regr_r2
[ https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37641: Assignee: (was: Apache Spark) > Support ANSI Aggregate Function: regr_r2 > > > Key: SPARK-37641 > URL: https://issues.apache.org/jira/browse/SPARK-37641 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_R2 is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37641) Support ANSI Aggregate Function: regr_r2
[ https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37641: Assignee: Apache Spark > Support ANSI Aggregate Function: regr_r2 > > > Key: SPARK-37641 > URL: https://issues.apache.org/jira/browse/SPARK-37641 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > REGR_R2 is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37641) Support ANSI Aggregate Function: regr_r2
[ https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458989#comment-17458989 ] Apache Spark commented on SPARK-37641: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34894 > Support ANSI Aggregate Function: regr_r2 > > > Key: SPARK-37641 > URL: https://issues.apache.org/jira/browse/SPARK-37641 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_R2 is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37641) Support ANSI Aggregate Function: regr_r2
[ https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458988#comment-17458988 ] Apache Spark commented on SPARK-37641: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34894 > Support ANSI Aggregate Function: regr_r2 > > > Key: SPARK-37641 > URL: https://issues.apache.org/jira/browse/SPARK-37641 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_R2 is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org