[jira] [Commented] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459690#comment-17459690
 ] 

Apache Spark commented on SPARK-37651:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34906

> Use existing active Spark session in all places of pandas API on Spark
> --
>
> Key: SPARK-37651
> URL: https://issues.apache.org/jira/browse/SPARK-37651
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Use existing active Spark session instead of SparkSession.getOrCreate in all 
> places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37651:


Assignee: (was: Apache Spark)

> Use existing active Spark session in all places of pandas API on Spark
> --
>
> Key: SPARK-37651
> URL: https://issues.apache.org/jira/browse/SPARK-37651
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Use existing active Spark session instead of SparkSession.getOrCreate in all 
> places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459688#comment-17459688
 ] 

Apache Spark commented on SPARK-37651:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34906

> Use existing active Spark session in all places of pandas API on Spark
> --
>
> Key: SPARK-37651
> URL: https://issues.apache.org/jira/browse/SPARK-37651
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Use existing active Spark session instead of SparkSession.getOrCreate in all 
> places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37651:


Assignee: Apache Spark

> Use existing active Spark session in all places of pandas API on Spark
> --
>
> Key: SPARK-37651
> URL: https://issues.apache.org/jira/browse/SPARK-37651
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Use existing active Spark session instead of SparkSession.getOrCreate in all 
> places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32294.
--
Fix Version/s: 3.3.0
   Resolution: Cannot Reproduce

This is fixed in the master branch. I can't reproduce it anymore:

{code}
>>> df = spark.range(1024 * 1024 * 1024 * 1).selectExpr("1 as a", "1 as b", "1 
>>> as c").coalesce(1)  # More than 2GB to hit ARROW-4890.
>>> df.groupby("a").applyInPandas(lambda pdf: pdf, schema=df.schema).count()
1073741824
{code}

> GroupedData Pandas UDF 2Gb limit
> 
>
> Key: SPARK-32294
> URL: https://issues.apache.org/jira/browse/SPARK-32294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
> Fix For: 3.3.0
>
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
> 2Gb limitation on Netty allocator side) - 
> https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches 
> to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-14 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37651:


 Summary: Use existing active Spark session in all places of pandas 
API on Spark
 Key: SPARK-37651
 URL: https://issues.apache.org/jira/browse/SPARK-37651
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Use existing active Spark session instead of SparkSession.getOrCreate in all 
places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37639) spark history server clean event log directory with out check status file

2021-12-14 Thread muhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459673#comment-17459673
 ] 

muhong edited comment on SPARK-37639 at 12/15/21, 6:31 AM:
---

to solve the problem, i modify the logic in EventLogFileWriters : when invoke 
method rollEventLogFile, check whether the status file exist first, if not 
recreate it;  it works.

 
{code:java}
/** exposed for testing only */
  private[history] def rollEventLogFile(): Unit = {
  // check wether the status file exist, if not recreate it 
   val isExist = checkAppStatusFileExist(true)
   if(!isExist){
   createAppStatusFile(inProgress = true)
   }
     closeWriter()    
  index += 1
    currentEventLogFilePath = getEventLogFilePath(logDirForAppPath, appId, 
appAttemptId, index,
      compressionCodecName)    initLogFile(currentEventLogFilePath) { os =>
      countingOutputStream = Some(new CountingOutputStream(os))
      new PrintWriter(
        new OutputStreamWriter(countingOutputStream.get, 
StandardCharsets.UTF_8))
    }
  } 

// new check method
private def checkAppStatusFileExist(inProgress:Boolean):Boolean = {
val appStatusPath = getAppStatusFilePath(logDirForAppPath, appId, 
appAttempId, inProgress)
val isExist = fileSystem.exists(appStatusPath)
isExist
}{code}
 

why i choose to modify like this, rather than change the logic in history 
server, because the long run spark thrift server might abnormal exit without 
change the status file,  but this directory still need be delete.

during the testing, i found another problem, if the thrift server normal exit 
with out any eventlog,  the history can not delete this kind of directory, 
because the history server will thrown illegalArgumentException“directory must 
have at least one eventlog file”。


was (Author: m-sir):
to solve the problem, i modify the logic in EventLogFileWriters : when invoke 
method rollEventLogFile, check whether the status file exist first, if not 
recreate it;  it works.

 
{code:java}
/** exposed for testing only */
  private[history] def rollEventLogFile(): Unit = {
  // check wether the status file exist, if not recreate it 

    closeWriter()    
  index += 1
    currentEventLogFilePath = getEventLogFilePath(logDirForAppPath, appId, 
appAttemptId, index,
      compressionCodecName)    initLogFile(currentEventLogFilePath) { os =>
      countingOutputStream = Some(new CountingOutputStream(os))
      new PrintWriter(
        new OutputStreamWriter(countingOutputStream.get, 
StandardCharsets.UTF_8))
    }
  } {code}
 

why i choose to modify like this, rather than change the logic in history 
server, because the long run spark thrift server might abnormal exit without 
change the status file,  but this directory still need be delete.


during the testing, i found another problem, if the thrift server normal exit 
with out any eventlog,  the history can not delete this kind of directory, 
because the history server will thrown illegalArgumentException“directory must 
have at least one eventlog file”。

> spark history server clean event log directory with out check status file
> -
>
> Key: SPARK-37639
> URL: https://issues.apache.org/jira/browse/SPARK-37639
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: muhong
>Priority: Major
>
> i foud a problem, the thrift server create event log file(.inprogress file 
> create at init), and history server clean the application event log file 
> according size and modtime. so there is a potential problem under this 
> situation
> *if the thrift server accept no quest long time(longer than time config by 
> spark.history.fs.cleaner.maxAge), the history server will clean  the 
> applicaiton log [directory] with the inprogress file; after clean  the thrift 
> server accept a lot of request ,and will generate new event log directory 
> without inprogress status file, and the director will never be clean by 
> history server because it not contain status file. this will leads spack leak*
> i think whenever create new log file , need to check wether the status file 
> is exist, if not create it
> last i think extra function need add, like log4j the compact file stii need 
> to be clean after a period(config by user),so ,long run spark service like 
> thrift server‘s event log file space can be limit in a config size



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37639) spark history server clean event log directory with out check status file

2021-12-14 Thread muhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459673#comment-17459673
 ] 

muhong commented on SPARK-37639:


to solve the problem, i modify the logic in EventLogFileWriters : when invoke 
method rollEventLogFile, check whether the status file exist first, if not 
recreate it;  it works.

 
{code:java}
/** exposed for testing only */
  private[history] def rollEventLogFile(): Unit = {
  // check wether the status file exist, if not recreate it 

    closeWriter()    
  index += 1
    currentEventLogFilePath = getEventLogFilePath(logDirForAppPath, appId, 
appAttemptId, index,
      compressionCodecName)    initLogFile(currentEventLogFilePath) { os =>
      countingOutputStream = Some(new CountingOutputStream(os))
      new PrintWriter(
        new OutputStreamWriter(countingOutputStream.get, 
StandardCharsets.UTF_8))
    }
  } {code}
 

why i choose to modify like this, rather than change the logic in history 
server, because the long run spark thrift server might abnormal exit without 
change the status file,  but this directory still need be delete.


during the testing, i found another problem, if the thrift server normal exit 
with out any eventlog,  the history can not delete this kind of directory, 
because the history server will thrown illegalArgumentException“directory must 
have at least one eventlog file”。

> spark history server clean event log directory with out check status file
> -
>
> Key: SPARK-37639
> URL: https://issues.apache.org/jira/browse/SPARK-37639
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: muhong
>Priority: Major
>
> i foud a problem, the thrift server create event log file(.inprogress file 
> create at init), and history server clean the application event log file 
> according size and modtime. so there is a potential problem under this 
> situation
> *if the thrift server accept no quest long time(longer than time config by 
> spark.history.fs.cleaner.maxAge), the history server will clean  the 
> applicaiton log [directory] with the inprogress file; after clean  the thrift 
> server accept a lot of request ,and will generate new event log directory 
> without inprogress status file, and the director will never be clean by 
> history server because it not contain status file. this will leads spack leak*
> i think whenever create new log file , need to check wether the status file 
> is exist, if not create it
> last i think extra function need add, like log4j the compact file stii need 
> to be clean after a period(config by user),so ,long run spark service like 
> thrift server‘s event log file space can be limit in a config size



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-36168) Support Scala 2.13 in `dev/test-dependencies.sh`

2021-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-36168.
-

> Support Scala 2.13 in `dev/test-dependencies.sh`
> 
>
> Key: SPARK-36168
> URL: https://issues.apache.org/jira/browse/SPARK-36168
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37628) Upgrade Netty from 4.1.68 to 4.1.72

2021-12-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-37628:
-
Description: 
After SPARK-36729, netty released 3 new versions:
 * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
 * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
 * [https://netty.io/news/2021/12/09/4-1-71-Final.html]
 * [https://netty.io/news/2021/12/13/4-1-72-Final.html]

We can avoid some potential bugs by upgrading to 4.1.72 and 4.1.72 also fixed 
security issues related to log4j2 in Netty

 

  was:
After SPARK-36729, netty released 3 new versions:
 * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
 * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
 * [https://netty.io/news/2021/12/09/4-1-71-Final.html]
 * [https://netty.io/news/2021/12/13/4-1-72-Final.html]

We can avoid some potential bugs by upgrading to 4.1.72

 


> Upgrade Netty from 4.1.68 to 4.1.72
> ---
>
> Key: SPARK-37628
> URL: https://issues.apache.org/jira/browse/SPARK-37628
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-36729, netty released 3 new versions:
>  * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
>  * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
>  * [https://netty.io/news/2021/12/09/4-1-71-Final.html]
>  * [https://netty.io/news/2021/12/13/4-1-72-Final.html]
> We can avoid some potential bugs by upgrading to 4.1.72 and 4.1.72 also fixed 
> security issues related to log4j2 in Netty
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37628) Upgrade Netty from 4.1.68 to 4.1.72

2021-12-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-37628:
-
Description: 
After SPARK-36729, netty released 3 new versions:
 * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
 * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
 * [https://netty.io/news/2021/12/09/4-1-71-Final.html]
 * [https://netty.io/news/2021/12/13/4-1-72-Final.html]

We can avoid some potential bugs by upgrading to 4.1.72

 

  was:
After SPARK-36729, netty released 3 new versions:
 * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
 * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
 * [https://netty.io/news/2021/12/09/4-1-71-Final.html]

We can avoid some potential bugs by upgrading to 4.1.71

 


> Upgrade Netty from 4.1.68 to 4.1.72
> ---
>
> Key: SPARK-37628
> URL: https://issues.apache.org/jira/browse/SPARK-37628
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-36729, netty released 3 new versions:
>  * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
>  * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
>  * [https://netty.io/news/2021/12/09/4-1-71-Final.html]
>  * [https://netty.io/news/2021/12/13/4-1-72-Final.html]
> We can avoid some potential bugs by upgrading to 4.1.72
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37628) Upgrade Netty from 4.1.68 to 4.1.72

2021-12-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-37628:
-
Summary: Upgrade Netty from 4.1.68 to 4.1.72  (was: Upgrade Netty from 
4.1.68 to 4.1.71)

> Upgrade Netty from 4.1.68 to 4.1.72
> ---
>
> Key: SPARK-37628
> URL: https://issues.apache.org/jira/browse/SPARK-37628
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-36729, netty released 3 new versions:
>  * [https://netty.io/news/2021/10/11/4-1-69-Final.html]
>  * [https://netty.io/news/2021/10/11/4-1-70-Final.html]
>  * [https://netty.io/news/2021/12/09/4-1-71-Final.html]
> We can avoid some potential bugs by upgrading to 4.1.71
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459647#comment-17459647
 ] 

Apache Spark commented on SPARK-37575:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/34905

> null values should be saved as nothing rather than quoted empty Strings "" 
> with default settings
> 
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
> Fix For: 3.3.0
>
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37644 ]


jiaan.geng deleted comment on SPARK-37644:


was (Author: beliefer):
I'm working on.

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459639#comment-17459639
 ] 

Apache Spark commented on SPARK-37644:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34904

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37644:


Assignee: (was: Apache Spark)

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37644:


Assignee: Apache Spark

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459638#comment-17459638
 ] 

Apache Spark commented on SPARK-37644:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34904

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37575:
-
Fix Version/s: (was: 3.2.1)

> null values should be saved as nothing rather than quoted empty Strings "" 
> with default settings
> 
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
> Fix For: 3.3.0
>
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37646.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34901
[https://github.com/apache/spark/pull/34901]

> Avoid touching Scala reflection APIs in the lit function
> 
>
> Key: SPARK-37646
> URL: https://issues.apache.org/jira/browse/SPARK-37646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently lit is slow when the concurrency is high as it needs to hit the 
> Scala reflection code which hits global locks. For example, running the 
> following test locally using Spark 3.2 shows the difference:
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)import 
> org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 
> 50def testLiteral(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           new Column(Literal(0L))
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }def testLit(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           lit(0L)
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }println("warmup")
> testLiteral()
> testLit()println("lit: false")
> spark.time {
>   testLiteral()
> }
> println("lit: true")
> spark.time {
>   testLit()
> }// Exiting paste mode, now interpreting.warmup
> lit: false
> Time taken: 8 ms
> lit: true
> Time taken: 682 ms
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literal
> parallelism: Int = 50
> testLiteral: ()Unit
> testLit: ()Unit {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37154) Inline type hints for python/pyspark/rdd.py

2021-12-14 Thread Byron Hsu (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37154 ]


Byron Hsu deleted comment on SPARK-37154:
---

was (Author: byronhsu):
i am working on this

> Inline type hints for python/pyspark/rdd.py
> ---
>
> Key: SPARK-37154
> URL: https://issues.apache.org/jira/browse/SPARK-37154
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37563) Implement days, seconds, microseconds properties of TimedeltaIndex

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37563.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34825
[https://github.com/apache/spark/pull/34825]

> Implement days, seconds, microseconds properties of TimedeltaIndex
> --
>
> Key: SPARK-37563
> URL: https://issues.apache.org/jira/browse/SPARK-37563
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Implement days, seconds, microseconds properties of TimedeltaIndex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37563) Implement days, seconds, microseconds properties of TimedeltaIndex

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37563:


Assignee: Xinrong Meng

> Implement days, seconds, microseconds properties of TimedeltaIndex
> --
>
> Key: SPARK-37563
> URL: https://issues.apache.org/jira/browse/SPARK-37563
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Implement days, seconds, microseconds properties of TimedeltaIndex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37638) Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37638.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34893
[https://github.com/apache/spark/pull/34893]

> Use existing active Spark session instead of SparkSession.getOrCreate in 
> pandas API on Spark
> 
>
> Key: SPARK-37638
> URL: https://issues.apache.org/jira/browse/SPARK-37638
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> {code}
> >>> ps.range(10)
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the 
> static sql configurations will not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some 
> spark core configurations may not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the 
> static sql configurations will not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some 
> spark core configurations may not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the 
> static sql configurations will not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some 
> spark core configurations may not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the 
> static sql configurations will not take effect.
> 21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some 
> spark core configurations may not take effect.
> ...
>id
> 0   0
> 1   1
> 2   2
> 3   3
> 4   4
> 5   5
> 6   6
> 7   7
> 8   8
> 9   9
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37649:


Assignee: Hyukjin Kwon

> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37649.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34902
[https://github.com/apache/spark/pull/34902]

> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459607#comment-17459607
 ] 

Hyukjin Kwon commented on SPARK-24853:
--

Oh I meant that it will allow something like:

{{withColumnRenamed(from_json(col("a"), col("another_col")))}} where 
{{from_json(col("a")}} is not actually an existing column.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459601#comment-17459601
 ] 

Nicholas Chammas commented on SPARK-24853:
--

Assuming we are talking about the example I provided: Yes, {{col("count")}} 
would still be ambiguous.

I don't know if Spark would know to catch that problem. But note that the 
current behavior of {{.withColumnRenamed('count', ...)}} renames all columns 
named "count", which is just incorrect.

So allowing {{col("count")}} will either be just as incorrect as the current 
behavior, or it will be an improvement in that Spark may complain that the 
column reference is ambiguous. I'd have to try it to confirm the behavior.

Of course, the main improvement offered by {{Column}} references is that users 
can do something like {{.withColumnRenamed(left_counts['count'], ...)}} and get 
the correct behavior.

I didn't follow what you are getting at regarding {{{}from_json{}}}, but does 
that address your concern?

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37629) speed up Expression.canonicalized

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37629:


Assignee: Wenchen Fan

> speed up Expression.canonicalized
> -
>
> Key: SPARK-37629
> URL: https://issues.apache.org/jira/browse/SPARK-37629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37629) speed up Expression.canonicalized

2021-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37629.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34883
[https://github.com/apache/spark/pull/34883]

> speed up Expression.canonicalized
> -
>
> Key: SPARK-37629
> URL: https://issues.apache.org/jira/browse/SPARK-37629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37650) Tell spark-env.sh the python interpreter

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37650:


Assignee: (was: Apache Spark)

> Tell spark-env.sh the python interpreter
> 
>
> Key: SPARK-37650
> URL: https://issues.apache.org/jira/browse/SPARK-37650
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Andrew Malone Melo
>Priority: Major
>
> When loading config defaults via spark-env.sh, it can be useful to know
> the current pyspark python interpreter to allow the configuration to set
> values properly. Pass this value in the environment as
> _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
> h3. What changes were proposed in this pull request?
> It's currently possible to set sensible site-wide spark configuration 
> defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a 
> user is using pyspark, however, there are a number of things that aren't 
> discoverable by that script, due to the way that it's called. There is a 
> chain of calls (java_gateway.py -> shell script -> java -> shell script) that 
> ends up obliterating any bit of the python context.
> This change proposes to add en environment variable 
> {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the 
> top-level python executable within pyspark's {{java_gateway.py}} 
> bootstrapping process. With that, spark-env.sh will be able to infer enough 
> information about the python environment to set the appropriate configuration 
> variables.
> h3. Why are the changes needed?
> Right now, there a number of config options useful to pyspark that can't be 
> reliably set by {{spark-env.sh}} because it is unaware of the python context 
> that spawning the executor. To give the most trivial example, it is currently 
> possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} 
> based on information readily available from the environment (e.g. the k8s 
> downward API). However, {{spark.pyspark.python}} and family cannot be set 
> because when {{spark-env.sh}} executes it's lost all of the python context. 
> We can instruct users to add the appropriate config variables, but this form 
> of cargo-culting is error-prone and not scalable. It would be much better to 
> expose important python variables so that pyspark can not be a second-class 
> citizen.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will 
> receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing 
> to the python executor.
> h3. How was this patch tested?
> To be perfectly honest, I don't know where this fits into the testing 
> infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to 
> java_gateway.py and that works, but in terms of adding this to the CI ... I'm 
> at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37650) Tell spark-env.sh the python interpreter

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459584#comment-17459584
 ] 

Apache Spark commented on SPARK-37650:
--

User 'PerilousApricot' has created a pull request for this issue:
https://github.com/apache/spark/pull/34903

> Tell spark-env.sh the python interpreter
> 
>
> Key: SPARK-37650
> URL: https://issues.apache.org/jira/browse/SPARK-37650
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Andrew Malone Melo
>Priority: Major
>
> When loading config defaults via spark-env.sh, it can be useful to know
> the current pyspark python interpreter to allow the configuration to set
> values properly. Pass this value in the environment as
> _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
> h3. What changes were proposed in this pull request?
> It's currently possible to set sensible site-wide spark configuration 
> defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a 
> user is using pyspark, however, there are a number of things that aren't 
> discoverable by that script, due to the way that it's called. There is a 
> chain of calls (java_gateway.py -> shell script -> java -> shell script) that 
> ends up obliterating any bit of the python context.
> This change proposes to add en environment variable 
> {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the 
> top-level python executable within pyspark's {{java_gateway.py}} 
> bootstrapping process. With that, spark-env.sh will be able to infer enough 
> information about the python environment to set the appropriate configuration 
> variables.
> h3. Why are the changes needed?
> Right now, there a number of config options useful to pyspark that can't be 
> reliably set by {{spark-env.sh}} because it is unaware of the python context 
> that spawning the executor. To give the most trivial example, it is currently 
> possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} 
> based on information readily available from the environment (e.g. the k8s 
> downward API). However, {{spark.pyspark.python}} and family cannot be set 
> because when {{spark-env.sh}} executes it's lost all of the python context. 
> We can instruct users to add the appropriate config variables, but this form 
> of cargo-culting is error-prone and not scalable. It would be much better to 
> expose important python variables so that pyspark can not be a second-class 
> citizen.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will 
> receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing 
> to the python executor.
> h3. How was this patch tested?
> To be perfectly honest, I don't know where this fits into the testing 
> infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to 
> java_gateway.py and that works, but in terms of adding this to the CI ... I'm 
> at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37650) Tell spark-env.sh the python interpreter

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37650:


Assignee: Apache Spark

> Tell spark-env.sh the python interpreter
> 
>
> Key: SPARK-37650
> URL: https://issues.apache.org/jira/browse/SPARK-37650
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Andrew Malone Melo
>Assignee: Apache Spark
>Priority: Major
>
> When loading config defaults via spark-env.sh, it can be useful to know
> the current pyspark python interpreter to allow the configuration to set
> values properly. Pass this value in the environment as
> _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
> h3. What changes were proposed in this pull request?
> It's currently possible to set sensible site-wide spark configuration 
> defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a 
> user is using pyspark, however, there are a number of things that aren't 
> discoverable by that script, due to the way that it's called. There is a 
> chain of calls (java_gateway.py -> shell script -> java -> shell script) that 
> ends up obliterating any bit of the python context.
> This change proposes to add en environment variable 
> {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the 
> top-level python executable within pyspark's {{java_gateway.py}} 
> bootstrapping process. With that, spark-env.sh will be able to infer enough 
> information about the python environment to set the appropriate configuration 
> variables.
> h3. Why are the changes needed?
> Right now, there a number of config options useful to pyspark that can't be 
> reliably set by {{spark-env.sh}} because it is unaware of the python context 
> that spawning the executor. To give the most trivial example, it is currently 
> possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} 
> based on information readily available from the environment (e.g. the k8s 
> downward API). However, {{spark.pyspark.python}} and family cannot be set 
> because when {{spark-env.sh}} executes it's lost all of the python context. 
> We can instruct users to add the appropriate config variables, but this form 
> of cargo-culting is error-prone and not scalable. It would be much better to 
> expose important python variables so that pyspark can not be a second-class 
> citizen.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will 
> receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing 
> to the python executor.
> h3. How was this patch tested?
> To be perfectly honest, I don't know where this fits into the testing 
> infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to 
> java_gateway.py and that works, but in terms of adding this to the CI ... I'm 
> at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2021-12-14 Thread YuanGuanhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuanGuanhu updated SPARK-37643:
---
Description: 
This ticket aim at fixing the bug that does not apply right-padding for char 
types partition column when charVarcharAsString is true and partition data 
length is less than defined length.
For example, a query below returns nothing in master, but a correct result is 
`abc`.
{code:java}
scala> sql("set spark.sql.legacy.charVarcharAsString=true")
scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by 
(c)")
scala> sql("INSERT INTO tb01 values(1, 'abc')")
scala> sql("select c from tb01 where c = 'abc'").show
+---+
|  c|
+---+
+---+{code}
This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
handle this consider conf spark.sql.legacy.charVarcharAsString value.

  was:
This ticket aim at fixing the bug that does not apply right-padding for char 
types partition column when charVarcharAsString is true and partition data 
length is lower than defined length.
For example, a query below returns nothing in master, but a correct result is 
`abc`.
{code:java}
scala> sql("set spark.sql.legacy.charVarcharAsString=true")
scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by 
(c)")
scala> sql("INSERT INTO tb01 values(1, 'abc')")
scala> sql("select c from tb01 where c = 'abc'").show
+---+
|  c|
+---+
+---+{code}
This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
handle this consider conf spark.sql.legacy.charVarcharAsString value.


> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types partition column when charVarcharAsString is true and partition data 
> length is less than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37650) Tell spark-env.sh the python interpreter

2021-12-14 Thread Andrew Malone Melo (Jira)
Andrew Malone Melo created SPARK-37650:
--

 Summary: Tell spark-env.sh the python interpreter
 Key: SPARK-37650
 URL: https://issues.apache.org/jira/browse/SPARK-37650
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Andrew Malone Melo


When loading config defaults via spark-env.sh, it can be useful to know
the current pyspark python interpreter to allow the configuration to set
values properly. Pass this value in the environment as
_PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
h3. What changes were proposed in this pull request?

It's currently possible to set sensible site-wide spark configuration defaults 
by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a user is 
using pyspark, however, there are a number of things that aren't discoverable 
by that script, due to the way that it's called. There is a chain of calls 
(java_gateway.py -> shell script -> java -> shell script) that ends up 
obliterating any bit of the python context.

This change proposes to add en environment variable 
{{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the 
top-level python executable within pyspark's {{java_gateway.py}} bootstrapping 
process. With that, spark-env.sh will be able to infer enough information about 
the python environment to set the appropriate configuration variables.
h3. Why are the changes needed?

Right now, there a number of config options useful to pyspark that can't be 
reliably set by {{spark-env.sh}} because it is unaware of the python context 
that spawning the executor. To give the most trivial example, it is currently 
possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} 
based on information readily available from the environment (e.g. the k8s 
downward API). However, {{spark.pyspark.python}} and family cannot be set 
because when {{spark-env.sh}} executes it's lost all of the python context. We 
can instruct users to add the appropriate config variables, but this form of 
cargo-culting is error-prone and not scalable. It would be much better to 
expose important python variables so that pyspark can not be a second-class 
citizen.
h3. Does this PR introduce _any_ user-facing change?

Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will receive 
an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing to the 
python executor.
h3. How was this patch tested?

To be perfectly honest, I don't know where this fits into the testing 
infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to 
java_gateway.py and that works, but in terms of adding this to the CI ... I'm 
at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459579#comment-17459579
 ] 

Hyukjin Kwon commented on SPARK-24853:
--

[~nchammas], actually I don't think this can truly handle the duplicate column 
names ...  allowing column will expose a possibility such as col("count") or 
other expressions such as from_json, etc.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459578#comment-17459578
 ] 

Apache Spark commented on SPARK-37649:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34902

> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37649:


Assignee: Apache Spark

> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37649:


Assignee: (was: Apache Spark)

> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter shoul

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37274.
--
Resolution: Won't Fix

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37626) Upgrade libthrift to 0.15.0

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37626.
--
Fix Version/s: (was: 3.3.0)
   Resolution: Duplicate

> Upgrade libthrift to 0.15.0
> ---
>
> Key: SPARK-37626
> URL: https://issues.apache.org/jira/browse/SPARK-37626
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>
> Upgrade libthrift to 1.15.0 in order to avoid 
> https://nvd.nist.gov/vuln/detail/CVE-2020-13949.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2021-12-14 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37649:


 Summary: Switch default index to distributed-sequence by default 
in pandas API on Spark
 Key: SPARK-37649
 URL: https://issues.apache.org/jira/browse/SPARK-37649
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


pandas API on Spark currently sets {{compute.default_index_type}} to 
{{sequence}} which relies on sending all data to one executor that easily 
causes OOM.

We should better switch to {{distributed-sequence}} type that truly distributes 
the data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36168) Support Scala 2.13 in `dev/test-dependencies.sh`

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36168.
--
Resolution: Not A Problem

> Support Scala 2.13 in `dev/test-dependencies.sh`
> 
>
> Key: SPARK-36168
> URL: https://issues.apache.org/jira/browse/SPARK-36168
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36584) ExecutorMonitor#onBlockUpdated will receive event from driver

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36584.
--
Resolution: Invalid

> ExecutorMonitor#onBlockUpdated will receive event from driver
> -
>
> Key: SPARK-36584
> URL: https://issues.apache.org/jira/browse/SPARK-36584
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
>Reporter: 胡振宇
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  When driver broadcast object, it will send the 
> [SparkListenerBlockUpdated|https://github.com/apache/spark/blob/df0ec56723f0b47c3629055fa7a8c63bb4285147/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L228]
>  event. 
> [ExecutorMonitor#onBlockUpdated|https://github.com/apache/spark/blob/df0ec56723f0b47c3629055fa7a8c63bb4285147/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L380]
>  receives and handles the event, in this method, 
>  it calls 
> [ensureExecutorIsTracked|https://github.com/apache/spark/blob/df0ec56723f0b47c3629055fa7a8c63bb4285147/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L489]
>  to put driver in `executors` variable with `UNKNOWN_RESOURCE_PROFILE_ID`. In 
> my understanding, `ExecutorMonitor` should only monitor executor not driver.  
> Although this will not cause any problems at the moment because 
> UNKNOWN_RESOURCE_PROFILE_ID will be filtered out, but I think this is a 
> potential risk



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37620) Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm)

2021-12-14 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37620:
---
Parent: SPARK-37094
Issue Type: Sub-task  (was: Improvement)

> Use more precise types for SparkContext Optional fields  (i.e. _gateway, _jvm)
> --
>
> Key: SPARK-37620
> URL: https://issues.apache.org/jira/browse/SPARK-37620
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> As a part of SPARK-37152 [we 
> agreed|https://github.com/apache/spark/pull/34466/files#r762609181] to keep 
> these typed as not {{Optional}} for simplicity, but this just a temporary 
> solution to move things forward, until we decide on best approach to handle 
> such cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37630) Security issue from Log4j 1.X exploit

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37630.
--
Resolution: Duplicate

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2021-12-14 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459567#comment-17459567
 ] 

Sean R. Owen commented on SPARK-37630:
--

Spark depends on a whole lot of other libraries, and they use log4j 1.x, like 
Hadoop. That's most of the issue. Spark doesn't really care about the logging 
framework, though it touches log4j -- mostly to configure logs from other 
libraries.

Anyone can try to fix it, but, it's harder than it sounds. You'd have to figure 
out how to plumb log4j 1.x calls to something else and exclude all log4j 1.x 
dependencies. This is a duplicate of other JIRAs

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37625) update log4j to 2.15

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37625.
--
Resolution: Duplicate

> update log4j to 2.15 
> -
>
> Key: SPARK-37625
> URL: https://issues.apache.org/jira/browse/SPARK-37625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: weifeng zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2021-12-14 Thread Tim Sanders (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459552#comment-17459552
 ] 

Tim Sanders commented on SPARK-26404:
-

I'm running into this as well.  I'm on Spark 3.2.0, not using Kubernetes.

 

Setting {{spark.pyspark.python}} via {{SparkSession.builder.config}} has no 
effect, but setting {{os.environ['PYSPARK_PYTHON']}} works as expected.

Setting {{PYSPARK_PYTHON}} inside of {{spark-env.sh}} does _not_ seem to work 
with {{SparkSession}}, but it _does_ work with {{spark-submit}}.

Setting {{spark.pyspark.python}} inside of {{spark-defaults.conf}} does seem to 
work for both {{spark-submit}} and {{SparkSession}}, but it doesn't help my use 
case as I want to change the selected version at runtime based on the version 
that the client is running (without maintaining multiple configs).

 

I agree with [~vpadulan], not sure why this was marked as "Not a Problem".  Any 
chance we can get this re-opened?

> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37146) Inline type hints for python/pyspark/__init__.py

2021-12-14 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37146.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34433
[https://github.com/apache/spark/pull/34433]

> Inline type hints for python/pyspark/__init__.py
> 
>
> Key: SPARK-37146
> URL: https://issues.apache.org/jira/browse/SPARK-37146
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37146) Inline type hints for python/pyspark/__init__.py

2021-12-14 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37146:
--

Assignee: dch nguyen

> Inline type hints for python/pyspark/__init__.py
> 
>
> Key: SPARK-37146
> URL: https://issues.apache.org/jira/browse/SPARK-37146
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37648) Spark catalog and Delta tables

2021-12-14 Thread Hanna Liashchuk (Jira)
Hanna Liashchuk created SPARK-37648:
---

 Summary: Spark catalog and Delta tables
 Key: SPARK-37648
 URL: https://issues.apache.org/jira/browse/SPARK-37648
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
 Environment: Spark version 3.1.2
Scala version 2.12.10
Hive version 2.3.7
Delta version 1.0.0
Reporter: Hanna Liashchuk


I'm using Spark with Delta tables, while tables are created, there are no 
columns in the table.

Steps to reproduce:
1. Start spark-shell 
{code:java}
spark-shell --conf 
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code}
2. Create delta table
{code:java}
spark.range(10).write.format("delta").option("path", 
"tmp/delta").saveAsTable("delta"){code}
3. Make sure table exists 
{code:java}
spark.catalog.listTables.show{code}
4. Find out that columns are not
{code:java}
spark.catalog.listColumns("delta").show{code}

This is critical for Delta integration with different BI tools such as Power BI 
or Tableau, as they are querying spark catalog for the metadata and we are 
getting errors that no columns are found. 
Discussion can be found in Delta repository - 
https://github.com/delta-io/delta/issues/695



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2021-12-14 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-25150.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

It looks like Spark 3.1.2 exhibits a different sort of broken behavior:
{code:java}
pyspark.sql.utils.AnalysisException: Column State#38 are ambiguous. It's 
probably because you joined several Datasets together, and some of these 
Datasets are the same. This column points to one of the Datasets but Spark is 
unable to figure out which one. Please alias the Datasets with different names 
via `Dataset.as` before joining them, and specify the column using qualified 
name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set 
spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. {code}
I don't think the join in {{zombie-analysis.py}} is ambiguous, and since this 
now works fine in Spark 3.2.0, that's what I'm going to mark as the "Fix 
Version" for this issue.

The fix must have made it in somewhere between Spark 3.1.2 and 3.2.0.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459494#comment-17459494
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I re-ran my test (described in the issue description + summarized in my comment 
just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with 
cross joins enabled or disabled, I now get the correct results.

Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran 
this test) was responsible for the fix.

But to be clear, in case anyone wants to reproduce my test:
 # Download all 6 files attached to this issue into a folder.
 # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and 
inspect the output.
 # Then, enable cross joins (commented out on line 9), rerun the script, and 
reinspect the output.
 # Compare the final bit of output from both runs against 
{{{}expected-output.txt{}}}.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459467#comment-17459467
 ] 

Nicholas Chammas commented on SPARK-24853:
--

[~hyukjin.kwon] - Are you still opposed to this proposed improvement? If not, 
I'd like to work on it.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26589) proper `median` method for spark dataframe

2021-12-14 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-26589.
--
Resolution: Won't Fix

Marking this as "Won't Fix", but I suppose if someone really wanted to, they 
could reopen this issue and propose adding a median function that is simply an 
alias for {{{}percentile(col, 0.5){}}}. Don't know how the committers would 
feel about that.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37645:
--
Component/s: Tests

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37645:
--
Affects Version/s: 3.3.0
   (was: 3.1.0)
   (was: 3.2.0)
   (was: 3.1.1)

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459455#comment-17459455
 ] 

Nicholas Chammas commented on SPARK-26589:
--

It looks like making a distributed, memory-efficient implementation of median 
is not possible using the design of Catalyst as it stands today. For more 
details, please see [this thread on the dev 
list|http://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3cCAOhmDzev8d4H20XT1hUP9us=cpjeysgcf+xev7lg7dka1gj...@mail.gmail.com%3e].

It's possible to get an exact median today by using {{{}percentile(col, 
0.5){}}}, which is [available via the SQL 
API|https://spark.apache.org/docs/3.2.0/sql-ref-functions-builtin.html#aggregate-functions].
 It's not memory-efficient, so it may not work well on large datasets.

The Python and Scala DataFrame APIs do not offer this exact percentile 
function, so I've filed SPARK-37647 to track exposing this function in those 
APIs.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37647) Expose percentile function in Scala/Python APIs

2021-12-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37647:


 Summary: Expose percentile function in Scala/Python APIs
 Key: SPARK-37647
 URL: https://issues.apache.org/jira/browse/SPARK-37647
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


SQL offers a percentile function (exact, not approximate) that is not available 
directly in the Scala or Python DataFrame APIs.

While it is possible to invoke SQL functions from Scala or Python via 
{{{}expr(){}}}, I think most users expect function parity across Scala, Python, 
and SQL. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37645:
-

Assignee: qian

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37645.
---
Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/34899

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37217:
-
Fix Version/s: 3.2.1

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.2.1, 3.3.0
>
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37646:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Avoid touching Scala reflection APIs in the lit function
> 
>
> Key: SPARK-37646
> URL: https://issues.apache.org/jira/browse/SPARK-37646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently lit is slow when the concurrency is high as it needs to hit the 
> Scala reflection code which hits global locks. For example, running the 
> following test locally using Spark 3.2 shows the difference:
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)import 
> org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 
> 50def testLiteral(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           new Column(Literal(0L))
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }def testLit(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           lit(0L)
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }println("warmup")
> testLiteral()
> testLit()println("lit: false")
> spark.time {
>   testLiteral()
> }
> println("lit: true")
> spark.time {
>   testLit()
> }// Exiting paste mode, now interpreting.warmup
> lit: false
> Time taken: 8 ms
> lit: true
> Time taken: 682 ms
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literal
> parallelism: Int = 50
> testLiteral: ()Unit
> testLit: ()Unit {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37646:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Avoid touching Scala reflection APIs in the lit function
> 
>
> Key: SPARK-37646
> URL: https://issues.apache.org/jira/browse/SPARK-37646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Currently lit is slow when the concurrency is high as it needs to hit the 
> Scala reflection code which hits global locks. For example, running the 
> following test locally using Spark 3.2 shows the difference:
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)import 
> org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 
> 50def testLiteral(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           new Column(Literal(0L))
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }def testLit(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           lit(0L)
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }println("warmup")
> testLiteral()
> testLit()println("lit: false")
> spark.time {
>   testLiteral()
> }
> println("lit: true")
> spark.time {
>   testLit()
> }// Exiting paste mode, now interpreting.warmup
> lit: false
> Time taken: 8 ms
> lit: true
> Time taken: 682 ms
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literal
> parallelism: Int = 50
> testLiteral: ()Unit
> testLit: ()Unit {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459394#comment-17459394
 ] 

Apache Spark commented on SPARK-37646:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/34901

> Avoid touching Scala reflection APIs in the lit function
> 
>
> Key: SPARK-37646
> URL: https://issues.apache.org/jira/browse/SPARK-37646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently lit is slow when the concurrency is high as it needs to hit the 
> Scala reflection code which hits global locks. For example, running the 
> following test locally using Spark 3.2 shows the difference:
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)import 
> org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 
> 50def testLiteral(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           new Column(Literal(0L))
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }def testLit(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           lit(0L)
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }println("warmup")
> testLiteral()
> testLit()println("lit: false")
> spark.time {
>   testLiteral()
> }
> println("lit: true")
> spark.time {
>   testLit()
> }// Exiting paste mode, now interpreting.warmup
> lit: false
> Time taken: 8 ms
> lit: true
> Time taken: 682 ms
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literal
> parallelism: Int = 50
> testLiteral: ()Unit
> testLit: ()Unit {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function

2021-12-14 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-37646:


Assignee: Shixiong Zhu

> Avoid touching Scala reflection APIs in the lit function
> 
>
> Key: SPARK-37646
> URL: https://issues.apache.org/jira/browse/SPARK-37646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently lit is slow when the concurrency is high as it needs to hit the 
> Scala reflection code which hits global locks. For example, running the 
> following test locally using Spark 3.2 shows the difference:
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)import 
> org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 
> 50def testLiteral(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           new Column(Literal(0L))
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }def testLit(): Unit = {
>   val ts = for (_ <- 0 until parallelism) yield {
>     new Thread() {
>       override def run() {
>          for (_ <- 0 until 50) {
>           lit(0L)
>         }
>       }
>     }
>   }
>   ts.foreach(_.start())
>   ts.foreach(_.join())
> }println("warmup")
> testLiteral()
> testLit()println("lit: false")
> spark.time {
>   testLiteral()
> }
> println("lit: true")
> spark.time {
>   testLit()
> }// Exiting paste mode, now interpreting.warmup
> lit: false
> Time taken: 8 ms
> lit: true
> Time taken: 682 ms
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.catalyst.expressions.Literal
> parallelism: Int = 50
> testLiteral: ()Unit
> testLit: ()Unit {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37646) Avoid touching Scala reflection APIs in the lit function

2021-12-14 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-37646:


 Summary: Avoid touching Scala reflection APIs in the lit function
 Key: SPARK-37646
 URL: https://issues.apache.org/jira/browse/SPARK-37646
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Shixiong Zhu


Currently lit is slow when the concurrency is high as it needs to hit the Scala 
reflection code which hits global locks. For example, running the following 
test locally using Spark 3.2 shows the difference:
{code:java}
scala> :paste
// Entering paste mode (ctrl-D to finish)import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.Literalval parallelism = 50def 
testLiteral(): Unit = {
  val ts = for (_ <- 0 until parallelism) yield {
    new Thread() {
      override def run() {
         for (_ <- 0 until 50) {
          new Column(Literal(0L))
        }
      }
    }
  }
  ts.foreach(_.start())
  ts.foreach(_.join())
}def testLit(): Unit = {
  val ts = for (_ <- 0 until parallelism) yield {
    new Thread() {
      override def run() {
         for (_ <- 0 until 50) {
          lit(0L)
        }
      }
    }
  }
  ts.foreach(_.start())
  ts.foreach(_.join())
}println("warmup")
testLiteral()
testLit()println("lit: false")
spark.time {
  testLiteral()
}
println("lit: true")
spark.time {
  testLit()
}// Exiting paste mode, now interpreting.warmup
lit: false
Time taken: 8 ms
lit: true
Time taken: 682 ms
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.Literal
parallelism: Int = 50
testLiteral: ()Unit
testLit: ()Unit {code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment

2021-12-14 Thread Aleksey Tsalolikhin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459286#comment-17459286
 ] 

Aleksey Tsalolikhin commented on SPARK-28884:
-

I've run into this as well, on Spark 3.1.2.  I launched in "cluster" deploy 
mode, with driver cores set to 32 but the UI shows 0:

 

!Screen Shot 2021-12-14 at 8.31.23 AM.png!

> [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode 
> deployment
> -
>
> Key: SPARK-28884
> URL: https://issues.apache.org/jira/browse/SPARK-28884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Core0.png, Core1.png, Screen Shot 2021-12-14 at 8.31.23 
> AM.png, Screen+Shot+2021-12-09+at+9.39.58+AM.png
>
>
> Launch spark sql in local and yarn mode
> bin/spark-shell --master local
> bin/spark-shell --master yarn( client and cluster both)
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode client --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> Open UI and check the driver core it display 0 but in local is display 1.
> Expectation: It should display 1 by default 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment

2021-12-14 Thread Aleksey Tsalolikhin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Tsalolikhin updated SPARK-28884:

Attachment: Screen Shot 2021-12-14 at 8.31.23 AM.png

> [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode 
> deployment
> -
>
> Key: SPARK-28884
> URL: https://issues.apache.org/jira/browse/SPARK-28884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Core0.png, Core1.png, Screen Shot 2021-12-14 at 8.31.23 
> AM.png, Screen+Shot+2021-12-09+at+9.39.58+AM.png
>
>
> Launch spark sql in local and yarn mode
> bin/spark-shell --master local
> bin/spark-shell --master yarn( client and cluster both)
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode client --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> Open UI and check the driver core it display 0 but in local is display 1.
> Expectation: It should display 1 by default 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment

2021-12-14 Thread Aleksey Tsalolikhin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Tsalolikhin updated SPARK-28884:

Attachment: Screen+Shot+2021-12-09+at+9.39.58+AM.png

> [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode 
> deployment
> -
>
> Key: SPARK-28884
> URL: https://issues.apache.org/jira/browse/SPARK-28884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Core0.png, Core1.png, Screen Shot 2021-12-14 at 8.31.23 
> AM.png, Screen+Shot+2021-12-09+at+9.39.58+AM.png
>
>
> Launch spark sql in local and yarn mode
> bin/spark-shell --master local
> bin/spark-shell --master yarn( client and cluster both)
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode client --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> Open UI and check the driver core it display 0 but in local is display 1.
> Expectation: It should display 1 by default 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37643:


Assignee: Apache Spark

> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Assignee: Apache Spark
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types partition column when charVarcharAsString is true and partition data 
> length is lower than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459273#comment-17459273
 ] 

Apache Spark commented on SPARK-37643:
--

User 'fhygh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34900

> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types partition column when charVarcharAsString is true and partition data 
> length is lower than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37643:


Assignee: (was: Apache Spark)

> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types partition column when charVarcharAsString is true and partition data 
> length is lower than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37592) Improve performance of JoinSelection

2021-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37592.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34844
[https://github.com/apache/spark/pull/34844]

> Improve performance of JoinSelection
> 
>
> Key: SPARK-37592
> URL: https://issues.apache.org/jira/browse/SPARK-37592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> When I reading the implement of AQE, I find the process select join with hint 
> exists a lot cumbersome code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37592) Improve performance of JoinSelection

2021-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37592:
---

Assignee: jiaan.geng

> Improve performance of JoinSelection
> 
>
> Key: SPARK-37592
> URL: https://issues.apache.org/jira/browse/SPARK-37592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> When I reading the implement of AQE, I find the process select join with hint 
> exists a lot cumbersome code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459180#comment-17459180
 ] 

Apache Spark commented on SPARK-37645:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34899

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37645:


Assignee: Apache Spark

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: qian
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37645:


Assignee: (was: Apache Spark)

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2021-12-14 Thread qian (Jira)
qian created SPARK-37645:


 Summary: Word spell error - "labeled" spells as "labled"
 Key: SPARK-37645
 URL: https://issues.apache.org/jira/browse/SPARK-37645
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.0, 3.1.1, 3.1.0
Reporter: qian
 Fix For: 3.3.0


Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37567) reuse Exchange failed

2021-12-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37567.
--
Resolution: Not A Problem

> reuse Exchange failed 
> --
>
> Key: SPARK-37567
> URL: https://issues.apache.org/jira/browse/SPARK-37567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: junbiao chen
>Priority: Major
>  Labels: bugfix, performance, pull-request-available
> Attachments: execution stage(1)-query2.png, execution 
> stage-query2.png, physical plan-query2.png
>
>
> PR available:https://github.com/apache/spark/pull/34858
> use case:query2 in TPC-DS.There are three exchange subquery will scan the 
> same table "store_sales" in logical plan,these subqueries meet exchange reuse 
> rule.I confirm that the exchange use rule  work in physical plan.But when 
> spark execute the physical plan,I find out 
> exchange reuse failed,reused exchange has been executed twice.
> physical plan:
> !physical plan-query2.png!
>  
> execution stages:
> !execution stage-query2.png!
>  
> !execution stage(1)-query2.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459135#comment-17459135
 ] 

Apache Spark commented on SPARK-37568:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34896

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37568:


Assignee: Apache Spark

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37568:


Assignee: (was: Apache Spark)

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459117#comment-17459117
 ] 

jiaan.geng commented on SPARK-37644:


I'm working on.

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-14 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-37644:
--

 Summary: Support datasource v2 complete aggregate pushdown 
 Key: SPARK-37644
 URL: https://issues.apache.org/jira/browse/SPARK-37644
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng


Currently , Spark supports push down aggregate with partial-agg and final-agg . 
For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg by 
running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37635) SHOW TBLPROPERTIES should print the fully qualified table name

2021-12-14 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37635.

Fix Version/s: 3.3.0
 Assignee: Wenchen Fan
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34890

> SHOW TBLPROPERTIES should print the fully qualified table name
> --
>
> Key: SPARK-37635
> URL: https://issues.apache.org/jira/browse/SPARK-37635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37310) Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default

2021-12-14 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37310.

Fix Version/s: 3.3.0
 Assignee: Terry Kim  (was: Apache Spark)
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34891

> Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default
> ---
>
> Key: SPARK-37310
> URL: https://issues.apache.org/jira/browse/SPARK-37310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37641) Support ANSI Aggregate Function: regr_r2

2021-12-14 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37641 ]


jiaan.geng deleted comment on SPARK-37641:


was (Author: beliefer):
I'm working on.

> Support ANSI Aggregate Function: regr_r2
> 
>
> Key: SPARK-37641
> URL: https://issues.apache.org/jira/browse/SPARK-37641
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> REGR_R2 is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2021-12-14 Thread Ismail H (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459018#comment-17459018
 ] 

Ismail H commented on SPARK-37630:
--

Sean, thanks for the comment. Since this is my first contribution in the 
community, I might have gotten some of the description wrong. My bad

* "The problem is Spark dependencies that need log4j 1.x. " Are you saying that 
those dependencies are different from Spark Core itself ? is there a Spark 
"internal component" that uses log4j ?
* Regarding the "it'd be nice to update to log4j 2(.15)", can anyone contribute 
and try resolving it or is it part of a specific roadmap ?

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37630) Security issue from Log4j 1.X exploit

2021-12-14 Thread Ismail H (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismail H updated SPARK-37630:
-
Summary: Security issue from Log4j 1.X exploit  (was: Security issue from 
Log4j 0day exploit)

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2021-12-14 Thread YuanGuanhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuanGuanhu updated SPARK-37643:
---
Description: 
This ticket aim at fixing the bug that does not apply right-padding for char 
types partition column when charVarcharAsString is true and partition data 
length is lower than defined length.
For example, a query below returns nothing in master, but a correct result is 
`abc`.
{code:java}
scala> sql("set spark.sql.legacy.charVarcharAsString=true")
scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by 
(c)")
scala> sql("INSERT INTO tb01 values(1, 'abc')")
scala> sql("select c from tb01 where c = 'abc'").show
+---+
|  c|
+---+
+---+{code}
This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
handle this consider conf spark.sql.legacy.charVarcharAsString value.

  was:
This ticket aim at fixing the bug that does not apply right-padding for char 
types partition column when charVarcharAsString is true and partition data 
length is lower than defined length.
For example, a query below returns nothing in master, but a correct result is 
`abc`.
{code:java}
scala> sql("set spark.sql.legacy.charVarcharAsString=true")
scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by 
(c)")
scala> sql("INSERT INTO tb01 values(1, 'abc')")
scala> sql("select * from tb01 where c = 'abc'").show
+---+
|  c|
+---+
+---+{code}
This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
handle this consider conf spark.sql.legacy.charVarcharAsString value.


> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types partition column when charVarcharAsString is true and partition data 
> length is lower than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2021-12-14 Thread YuanGuanhu (Jira)
YuanGuanhu created SPARK-37643:
--

 Summary: when charVarcharAsString is true, char datatype partition 
table query incorrect
 Key: SPARK-37643
 URL: https://issues.apache.org/jira/browse/SPARK-37643
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.2
 Environment: spark 3.2.0
Reporter: YuanGuanhu


This ticket aim at fixing the bug that does not apply right-padding for char 
types partition column when charVarcharAsString is true and partition data 
length is lower than defined length.
For example, a query below returns nothing in master, but a correct result is 
`abc`.
{code:java}
scala> sql("set spark.sql.legacy.charVarcharAsString=true")
scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned by 
(c)")
scala> sql("INSERT INTO tb01 values(1, 'abc')")
scala> sql("select * from tb01 where c = 'abc'").show
+---+
|  c|
+---+
+---+{code}
This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6305) Add support for log4j 2.x to Spark

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6305:
---

Assignee: (was: Apache Spark)

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6305) Add support for log4j 2.x to Spark

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6305:
---

Assignee: Apache Spark

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Assignee: Apache Spark
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459007#comment-17459007
 ] 

Apache Spark commented on SPARK-6305:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/34895

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37641) Support ANSI Aggregate Function: regr_r2

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37641:


Assignee: (was: Apache Spark)

> Support ANSI Aggregate Function: regr_r2
> 
>
> Key: SPARK-37641
> URL: https://issues.apache.org/jira/browse/SPARK-37641
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> REGR_R2 is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37641) Support ANSI Aggregate Function: regr_r2

2021-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37641:


Assignee: Apache Spark

> Support ANSI Aggregate Function: regr_r2
> 
>
> Key: SPARK-37641
> URL: https://issues.apache.org/jira/browse/SPARK-37641
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> REGR_R2 is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37641) Support ANSI Aggregate Function: regr_r2

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458989#comment-17458989
 ] 

Apache Spark commented on SPARK-37641:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34894

> Support ANSI Aggregate Function: regr_r2
> 
>
> Key: SPARK-37641
> URL: https://issues.apache.org/jira/browse/SPARK-37641
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> REGR_R2 is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37641) Support ANSI Aggregate Function: regr_r2

2021-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458988#comment-17458988
 ] 

Apache Spark commented on SPARK-37641:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34894

> Support ANSI Aggregate Function: regr_r2
> 
>
> Key: SPARK-37641
> URL: https://issues.apache.org/jira/browse/SPARK-37641
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> REGR_R2 is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >