[jira] [Commented] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Bobby Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522646#comment-17522646
 ] 

Bobby Wang commented on SPARK-38911:


[~tgraves]  Could you help to check that

> 'test 1 resource profile' throws exception when running it in IDEA separately
> -
>
> Key: SPARK-38911
> URL: https://issues.apache.org/jira/browse/SPARK-38911
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Bobby Wang
>Priority: Minor
>
> The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
> in IDEA separately.
>     
> The root cause is the ResourceProfile is initialized before SparkContext, and 
> it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But 
> the test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
>     
> {code:java}
> assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}
>     
> The exception is like below,
>     
> {code:java}
>     0 equaled 0
>     ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
> (DAGSchedulerSuite.scala:3269)
>     org.scalatest.exceptions.TestFailedException: 0 equaled 0
>             at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>             at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>             at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>             at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>             at 
> org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
>             at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
> This issue does not exist when running all DAGSchedulerSuite one by one, 
> since the SparkContext will be initialized at the very beginning.
>  
> I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Bobby Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522645#comment-17522645
 ] 

Bobby Wang commented on SPARK-38911:


Just submited an PR for this issue https://github.com/apache/spark/pull/36208

> 'test 1 resource profile' throws exception when running it in IDEA separately
> -
>
> Key: SPARK-38911
> URL: https://issues.apache.org/jira/browse/SPARK-38911
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Bobby Wang
>Priority: Minor
>
> The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
> in IDEA separately.
>     
> The root cause is the ResourceProfile is initialized before SparkContext, and 
> it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But 
> the test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
>     
> {code:java}
> assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}
>     
> The exception is like below,
>     
> {code:java}
>     0 equaled 0
>     ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
> (DAGSchedulerSuite.scala:3269)
>     org.scalatest.exceptions.TestFailedException: 0 equaled 0
>             at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>             at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>             at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>             at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>             at 
> org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
>             at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
> This issue does not exist when running all DAGSchedulerSuite one by one, 
> since the SparkContext will be initialized at the very beginning.
>  
> I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522644#comment-17522644
 ] 

Apache Spark commented on SPARK-38911:
--

User 'wbo4958' has created a pull request for this issue:
https://github.com/apache/spark/pull/36208

> 'test 1 resource profile' throws exception when running it in IDEA separately
> -
>
> Key: SPARK-38911
> URL: https://issues.apache.org/jira/browse/SPARK-38911
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Bobby Wang
>Priority: Minor
>
> The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
> in IDEA separately.
>     
> The root cause is the ResourceProfile is initialized before SparkContext, and 
> it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But 
> the test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
>     
> {code:java}
> assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}
>     
> The exception is like below,
>     
> {code:java}
>     0 equaled 0
>     ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
> (DAGSchedulerSuite.scala:3269)
>     org.scalatest.exceptions.TestFailedException: 0 equaled 0
>             at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>             at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>             at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>             at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>             at 
> org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
>             at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
> This issue does not exist when running all DAGSchedulerSuite one by one, 
> since the SparkContext will be initialized at the very beginning.
>  
> I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522643#comment-17522643
 ] 

Apache Spark commented on SPARK-38911:
--

User 'wbo4958' has created a pull request for this issue:
https://github.com/apache/spark/pull/36208

> 'test 1 resource profile' throws exception when running it in IDEA separately
> -
>
> Key: SPARK-38911
> URL: https://issues.apache.org/jira/browse/SPARK-38911
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Bobby Wang
>Priority: Minor
>
> The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
> in IDEA separately.
>     
> The root cause is the ResourceProfile is initialized before SparkContext, and 
> it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But 
> the test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
>     
> {code:java}
> assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}
>     
> The exception is like below,
>     
> {code:java}
>     0 equaled 0
>     ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
> (DAGSchedulerSuite.scala:3269)
>     org.scalatest.exceptions.TestFailedException: 0 equaled 0
>             at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>             at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>             at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>             at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>             at 
> org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
>             at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
> This issue does not exist when running all DAGSchedulerSuite one by one, 
> since the SparkContext will be initialized at the very beginning.
>  
> I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38911:


Assignee: Apache Spark

> 'test 1 resource profile' throws exception when running it in IDEA separately
> -
>
> Key: SPARK-38911
> URL: https://issues.apache.org/jira/browse/SPARK-38911
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Bobby Wang
>Assignee: Apache Spark
>Priority: Minor
>
> The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
> in IDEA separately.
>     
> The root cause is the ResourceProfile is initialized before SparkContext, and 
> it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But 
> the test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
>     
> {code:java}
> assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}
>     
> The exception is like below,
>     
> {code:java}
>     0 equaled 0
>     ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
> (DAGSchedulerSuite.scala:3269)
>     org.scalatest.exceptions.TestFailedException: 0 equaled 0
>             at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>             at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>             at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>             at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>             at 
> org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
>             at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
> This issue does not exist when running all DAGSchedulerSuite one by one, 
> since the SparkContext will be initialized at the very beginning.
>  
> I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38911:


Assignee: (was: Apache Spark)

> 'test 1 resource profile' throws exception when running it in IDEA separately
> -
>
> Key: SPARK-38911
> URL: https://issues.apache.org/jira/browse/SPARK-38911
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Bobby Wang
>Priority: Minor
>
> The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
> in IDEA separately.
>     
> The root cause is the ResourceProfile is initialized before SparkContext, and 
> it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But 
> the test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
>     
> {code:java}
> assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}
>     
> The exception is like below,
>     
> {code:java}
>     0 equaled 0
>     ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
> (DAGSchedulerSuite.scala:3269)
>     org.scalatest.exceptions.TestFailedException: 0 equaled 0
>             at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>             at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>             at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>             at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>             at 
> org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
>             at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
> This issue does not exist when running all DAGSchedulerSuite one by one, 
> since the SparkContext will be initialized at the very beginning.
>  
> I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38911) 'test 1 resource profile' throws exception when running it in IDEA separately

2022-04-14 Thread Bobby Wang (Jira)
Bobby Wang created SPARK-38911:
--

 Summary: 'test 1 resource profile' throws exception when running 
it in IDEA separately
 Key: SPARK-38911
 URL: https://issues.apache.org/jira/browse/SPARK-38911
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.2.1
Reporter: Bobby Wang


The test `test 1 resource profile` of DAGSchedulerSuite will fail if I run it 
in IDEA separately.
    
The root cause is the ResourceProfile is initialized before SparkContext, and 
it will take `DEFAULT_RESOURCE_PROFILE_ID` as the resource profile id. But the 
test asserts that the id is not equal  to DEFAULT_RESOURCE_PROFILE_ID.
    
{code:java}
assert(expectedid.get != ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID){code}

    
The exception is like below,
    
{code:java}
    0 equaled 0
    ScalaTestFailureLocation: org.apache.spark.scheduler.DAGSchedulerSuite at 
(DAGSchedulerSuite.scala:3269)
    org.scalatest.exceptions.TestFailedException: 0 equaled 0
            at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
            at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
            at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
            at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
            at 
org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$191(DAGSchedulerSuite.scala:3269)
            at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85){code}
This issue does not exist when running all DAGSchedulerSuite one by one, since 
the SparkContext will be initialized at the very beginning.

 

I will submit a patch to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38910) Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38910:


Assignee: (was: Apache Spark)

> Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too
> 
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
>  def run(): Unit = {
> submitApplication()
> if (!launcherBackend.isConnected() && fireAndForget) {
>   val report = getApplicationReport(appId)
>   val state = report.getYarnApplicationState
>   logInfo(s"Application report for $appId (state: $state)")
>   logInfo(formatReportDetails(report, getDriverLogsLink(report)))
>   if (state == YarnApplicationState.FAILED || state == 
> YarnApplicationState.KILLED) {
> throw new SparkException(s"Application $appId finished with status: 
> $state")
>   }
> } else {
>   val YarnAppReport(appState, finalState, diags) = 
> monitorApplication(appId)
>   if (appState == YarnApplicationState.FAILED || finalState == 
> FinalApplicationStatus.FAILED) {
> var amContainerSucceed = false
> val amContainerExitMsg = s"AM Container for " +
>   
> s"${yarnClient.getApplicationReport(appId).getCurrentApplicationAttemptId} " +
>   s"exited with  exitCode: 0"
> diags.foreach { err =>
>   logError(s"Application diagnostics message: $err")
>   if (err.contains(amContainerExitMsg)) {
> amContainerSucceed = true
>   
> {code}
> Not clean the staging dir when match case 
> {code:jave}
> !launcherBackend.isConnected() && fireAndForget
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38910) Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522638#comment-17522638
 ] 

Apache Spark commented on SPARK-38910:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36207

> Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too
> 
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
>  def run(): Unit = {
> submitApplication()
> if (!launcherBackend.isConnected() && fireAndForget) {
>   val report = getApplicationReport(appId)
>   val state = report.getYarnApplicationState
>   logInfo(s"Application report for $appId (state: $state)")
>   logInfo(formatReportDetails(report, getDriverLogsLink(report)))
>   if (state == YarnApplicationState.FAILED || state == 
> YarnApplicationState.KILLED) {
> throw new SparkException(s"Application $appId finished with status: 
> $state")
>   }
> } else {
>   val YarnAppReport(appState, finalState, diags) = 
> monitorApplication(appId)
>   if (appState == YarnApplicationState.FAILED || finalState == 
> FinalApplicationStatus.FAILED) {
> var amContainerSucceed = false
> val amContainerExitMsg = s"AM Container for " +
>   
> s"${yarnClient.getApplicationReport(appId).getCurrentApplicationAttemptId} " +
>   s"exited with  exitCode: 0"
> diags.foreach { err =>
>   logError(s"Application diagnostics message: $err")
>   if (err.contains(amContainerExitMsg)) {
> amContainerSucceed = true
>   
> {code}
> Not clean the staging dir when match case 
> {code:jave}
> !launcherBackend.isConnected() && fireAndForget
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38910) Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38910:


Assignee: Apache Spark

> Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too
> 
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
>  def run(): Unit = {
> submitApplication()
> if (!launcherBackend.isConnected() && fireAndForget) {
>   val report = getApplicationReport(appId)
>   val state = report.getYarnApplicationState
>   logInfo(s"Application report for $appId (state: $state)")
>   logInfo(formatReportDetails(report, getDriverLogsLink(report)))
>   if (state == YarnApplicationState.FAILED || state == 
> YarnApplicationState.KILLED) {
> throw new SparkException(s"Application $appId finished with status: 
> $state")
>   }
> } else {
>   val YarnAppReport(appState, finalState, diags) = 
> monitorApplication(appId)
>   if (appState == YarnApplicationState.FAILED || finalState == 
> FinalApplicationStatus.FAILED) {
> var amContainerSucceed = false
> val amContainerExitMsg = s"AM Container for " +
>   
> s"${yarnClient.getApplicationReport(appId).getCurrentApplicationAttemptId} " +
>   s"exited with  exitCode: 0"
> diags.foreach { err =>
>   logError(s"Application diagnostics message: $err")
>   if (err.contains(amContainerExitMsg)) {
> amContainerSucceed = true
>   
> {code}
> Not clean the staging dir when match case 
> {code:jave}
> !launcherBackend.isConnected() && fireAndForget
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38910) Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is false too

2022-04-14 Thread angerszhu (Jira)
angerszhu created SPARK-38910:
-

 Summary: Clean sparkStaging dir when WAIT_FOR_APP_COMPLETION is 
false too
 Key: SPARK-38910
 URL: https://issues.apache.org/jira/browse/SPARK-38910
 Project: Spark
  Issue Type: Task
  Components: YARN
Affects Versions: 3.2.1, 3.3.0
Reporter: angerszhu



{code:java}
 def run(): Unit = {
submitApplication()
if (!launcherBackend.isConnected() && fireAndForget) {
  val report = getApplicationReport(appId)
  val state = report.getYarnApplicationState
  logInfo(s"Application report for $appId (state: $state)")
  logInfo(formatReportDetails(report, getDriverLogsLink(report)))
  if (state == YarnApplicationState.FAILED || state == 
YarnApplicationState.KILLED) {
throw new SparkException(s"Application $appId finished with status: 
$state")
  }
} else {
  val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
  if (appState == YarnApplicationState.FAILED || finalState == 
FinalApplicationStatus.FAILED) {
var amContainerSucceed = false
val amContainerExitMsg = s"AM Container for " +
  
s"${yarnClient.getApplicationReport(appId).getCurrentApplicationAttemptId} " +
  s"exited with  exitCode: 0"
diags.foreach { err =>
  logError(s"Application diagnostics message: $err")
  if (err.contains(amContainerExitMsg)) {
amContainerSucceed = true
  
{code}


Not clean the staging dir when match case 
{code:jave}
!launcherBackend.isConnected() && fireAndForget
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-04-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522630#comment-17522630
 ] 

Dongjoon Hyun edited comment on SPARK-37814 at 4/15/22 3:07 AM:


BTW, one more tip to you, [~brentwritescode]. This JIRA SPARK-37814 is already 
resolved in `branch-3.3` and irrelevant to your future suggestion for Apache 
Spark 3.4. Since you can open a new one for your suggestion, feel free to 
suggest anything on your own JIRA. You're welcome.
{quote}If you think this is a good path forward for the Spark project, I'd be 
happy to make a Jira or GitHub issue for it if no one has yet.
{quote}


was (Author: dongjoon):
BTW, one more tip to you, [~brentwritescode]. This JIRA SPARK-37814 is already 
resolved in `branch-3.3` and irrelevant to your future suggestion for Apache 
Spark 3.4. Since you can open a new one for your suggestion, free free to 
suggest anything on your own JIRA. You're welcome.
{quote}If you think this is a good path forward for the Spark project, I'd be 
happy to make a Jira or GitHub issue for it if no one has yet.
{quote}

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-04-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522630#comment-17522630
 ] 

Dongjoon Hyun commented on SPARK-37814:
---

BTW, one more tip to you, [~brentwritescode]. This JIRA SPARK-37814 is already 
resolved in `branch-3.3` and irrelevant to your future suggestion for Apache 
Spark 3.4. Since you can open a new one for your suggestion, free free to 
suggest anything on your own JIRA. You're welcome.
{quote}If you think this is a good path forward for the Spark project, I'd be 
happy to make a Jira or GitHub issue for it if no one has yet.
{quote}

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-04-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522627#comment-17522627
 ] 

Dongjoon Hyun commented on SPARK-37814:
---

Hi, [~brentwritescode]. Thank you for the suggestion but none of them are 
released yet, aren't they?

- First, I don't think the future release plan of Apache Hadoop community could 
be the blocker for Apache Spark releases. As you know, Apache Spark 3.3+ 
already moved to log4j2.
- For Hadoop 2, Apache Spark binary distribution is Hadoop 2.7.4 and we have no 
plan to upgrade to Hadoop 2.10.x. So, that doesn't look like a path for us 
unfortunately.
- For Hadoop 3, I'm sure that Apache Spark community is going to try Apache 
Hadoop 3.3.4 with Apache Spark 3.4 timeframe. However, there is no guarantee in 
the open source community. Apache Hadoop 3.3.2 is also still under active 
testing and we might revert it back to old one during RC period. Hadoop is one 
of several key dependencies which we are considering seriously.

For the Apache Hadoop releases, let's talk later when the real releases arrives 
to us so that we can play around them.

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38909:


Assignee: Apache Spark

> Encapsulate LevelDB used by ExternalShuffleBlockResolver and 
> YarnShuffleService as LocalDB
> --
>
> Key: SPARK-38909
> URL: https://issues.apache.org/jira/browse/SPARK-38909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
> directly{}}}, this is not conducive to extending the use of {{RocksDB}} in 
> this scenario. This pr is encapsulated for expansibility. It will be the 
> pre-work of SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-04-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-38909:
-
Description: {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  
use {{{}LevelDB directly{}}}, this is not conducive to extending the use of 
{{RocksDB}} in this scenario. This pr is encapsulated for expansibility. It 
will be the pre-work of SPARK-3  (was: {{ExternalShuffleBlockResolver}} and 
{{YarnShuffleService}}  use {{{}LevelDB directly{}}}, this is not conducive to 
extending the use of {{RocksDB}} in this scenario. This pr is encapsulated for 
expansibility. It will be the pre pr of SPARK-3)

> Encapsulate LevelDB used by ExternalShuffleBlockResolver and 
> YarnShuffleService as LocalDB
> --
>
> Key: SPARK-38909
> URL: https://issues.apache.org/jira/browse/SPARK-38909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
> directly{}}}, this is not conducive to extending the use of {{RocksDB}} in 
> this scenario. This pr is encapsulated for expansibility. It will be the 
> pre-work of SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522622#comment-17522622
 ] 

Apache Spark commented on SPARK-38909:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36200

> Encapsulate LevelDB used by ExternalShuffleBlockResolver and 
> YarnShuffleService as LocalDB
> --
>
> Key: SPARK-38909
> URL: https://issues.apache.org/jira/browse/SPARK-38909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
> directly{}}}, this is not conducive to extending the use of {{RocksDB}} in 
> this scenario. This pr is encapsulated for expansibility. It will be the 
> pre-work of SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38909:


Assignee: (was: Apache Spark)

> Encapsulate LevelDB used by ExternalShuffleBlockResolver and 
> YarnShuffleService as LocalDB
> --
>
> Key: SPARK-38909
> URL: https://issues.apache.org/jira/browse/SPARK-38909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
> directly{}}}, this is not conducive to extending the use of {{RocksDB}} in 
> this scenario. This pr is encapsulated for expansibility. It will be the 
> pre-work of SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-04-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-38909:


 Summary: Encapsulate LevelDB used by ExternalShuffleBlockResolver 
and YarnShuffleService as LocalDB
 Key: SPARK-38909
 URL: https://issues.apache.org/jira/browse/SPARK-38909
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 3.4.0
Reporter: Yang Jie


{{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
directly{}}}, this is not conducive to extending the use of {{RocksDB}} in this 
scenario. This pr is encapsulated for expansibility. It will be the pre pr of 
SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38908) Provide query context in runtime error of Casting from String to Number/Date/Timestamp/Boolean

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522613#comment-17522613
 ] 

Apache Spark commented on SPARK-38908:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36206

> Provide query context in runtime error of Casting from String to 
> Number/Date/Timestamp/Boolean
> --
>
> Key: SPARK-38908
> URL: https://issues.apache.org/jira/browse/SPARK-38908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38908) Provide query context in runtime error of Casting from String to Number/Date/Timestamp/Boolean

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38908:


Assignee: Gengliang Wang  (was: Apache Spark)

> Provide query context in runtime error of Casting from String to 
> Number/Date/Timestamp/Boolean
> --
>
> Key: SPARK-38908
> URL: https://issues.apache.org/jira/browse/SPARK-38908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38908) Provide query context in runtime error of Casting from String to Number/Date/Timestamp/Boolean

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38908:


Assignee: Apache Spark  (was: Gengliang Wang)

> Provide query context in runtime error of Casting from String to 
> Number/Date/Timestamp/Boolean
> --
>
> Key: SPARK-38908
> URL: https://issues.apache.org/jira/browse/SPARK-38908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38908) Provide query context in runtime error of Casting from String to Number/Date/Timestamp/Boolean

2022-04-14 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38908:
--

 Summary: Provide query context in runtime error of Casting from 
String to Number/Date/Timestamp/Boolean
 Key: SPARK-38908
 URL: https://issues.apache.org/jira/browse/SPARK-38908
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522610#comment-17522610
 ] 

Rafal Wojdyla commented on SPARK-38904:
---

[~hyukjin.kwon] thanks for the comment, sounds good to me, just want to point 
out that at least in my case it's important that the metadata of the columns 
gets "updated".

> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *know* is compatible, I 
> could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/{{{}StructType{}}}
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in 
> [https://github.com/ravwojdyla/spark-schema-utils]
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38907) Impl DataFrame.corrwith

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38907:


Assignee: Apache Spark

> Impl DataFrame.corrwith
> ---
>
> Key: SPARK-38907
> URL: https://issues.apache.org/jira/browse/SPARK-38907
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38907) Impl DataFrame.corrwith

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38907:


Assignee: (was: Apache Spark)

> Impl DataFrame.corrwith
> ---
>
> Key: SPARK-38907
> URL: https://issues.apache.org/jira/browse/SPARK-38907
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38907) Impl DataFrame.corrwith

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522608#comment-17522608
 ] 

Apache Spark commented on SPARK-38907:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/36205

> Impl DataFrame.corrwith
> ---
>
> Key: SPARK-38907
> URL: https://issues.apache.org/jira/browse/SPARK-38907
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38907) Impl DataFrame.corrwith

2022-04-14 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-38907:


 Summary: Impl DataFrame.corrwith
 Key: SPARK-38907
 URL: https://issues.apache.org/jira/browse/SPARK-38907
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38902) cast as char/varchar result is string, not expect data type

2022-04-14 Thread YuanGuanhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuanGuanhu updated SPARK-38902:
---
Issue Type: Improvement  (was: Bug)

> cast as char/varchar result is string, not expect data type
> ---
>
> Key: SPARK-38902
> URL: https://issues.apache.org/jira/browse/SPARK-38902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: YuanGuanhu
>Priority: Major
>
> when cast column to char/varchar type, result is string, not expected data 
> type



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522599#comment-17522599
 ] 

Hyukjin Kwon commented on SPARK-38904:
--

I think we should have an API like DataFrame.select(StructType) so we don't 
need to trigger another ser/de via conversion between RDD and DataFrame.

> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *know* is compatible, I 
> could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/{{{}StructType{}}}
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in 
> [https://github.com/ravwojdyla/spark-schema-utils]
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38905:
-

Assignee: Dongjoon Hyun

> Upgrade ORC to 1.6.14
> -
>
> Key: SPARK-38905
> URL: https://issues.apache.org/jira/browse/SPARK-38905
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38905.
---
Fix Version/s: 3.2.2
   Resolution: Fixed

Issue resolved by pull request 36204
[https://github.com/apache/spark/pull/36204]

> Upgrade ORC to 1.6.14
> -
>
> Key: SPARK-38905
> URL: https://issues.apache.org/jira/browse/SPARK-38905
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-04-14 Thread Brent (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522589#comment-17522589
 ] 

Brent commented on SPARK-37814:
---

[~kabhwan] [~dongjoon] I happened to notice your conversation about seeing what 
Hadoop does with regards to maintenance versions and I was just looking at 
their GitHub and Jira a little while ago.  They did indeed move to Reload4j for 
their 3.3.x, 3.2.x and 2.10.x release lines (while I believe they're moving to 
Logback for 3.4.x and beyond).

For reference, here is the Jira:  
https://issues.apache.org/jira/browse/HADOOP-18088

And here are the pull requests:
 * Hadoop 2.10.2: [https://github.com/apache/hadoop/pull/4151]
 * Hadoop 3.2.4: [https://github.com/apache/hadoop/pull/4084]
 * Hadoop 3.3.4: [https://github.com/apache/hadoop/pull/4052]

If you think this is a good path forward for the Spark project, I'd be happy to 
make a Jira or GitHub issue for it if no one has yet.

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38823.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36183
[https://github.com/apache/spark/pull/36183]

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.0
>
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38823:


Assignee: Bruce Robbins

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38898:


Assignee: Yikun Jiang

> Failed to build python docker images due to .cache not found
> 
>
> Key: SPARK-38898
> URL: https://issues.apache.org/jira/browse/SPARK-38898
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> rm: cannot remove '/root/.cache': No such file or directory
> Related:
> [https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38898.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36198
[https://github.com/apache/spark/pull/36198]

> Failed to build python docker images due to .cache not found
> 
>
> Key: SPARK-38898
> URL: https://issues.apache.org/jira/browse/SPARK-38898
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> rm: cannot remove '/root/.cache': No such file or directory
> Related:
> [https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36664) Log time spent waiting for cluster resources

2022-04-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36664:
--
Affects Version/s: 3.4.0
   (was: 3.2.0)
   (was: 3.3.0)

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36664) Log time spent waiting for cluster resources

2022-04-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36664:
--
Target Version/s:   (was: 3.3.0)

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38906) Support joinWith() with watermarks

2022-04-14 Thread John P (Jira)
John P created SPARK-38906:
--

 Summary: Support joinWith() with watermarks
 Key: SPARK-38906
 URL: https://issues.apache.org/jira/browse/SPARK-38906
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.2.0
Reporter: John P


Structured streaming requires a watermark for outer joins. This makes it 
impossible to use joinWith.

I've attached a self-contained project example 
[here|https://gist.github.com/jpassaro/886d5febe6e40bced03a3691115c84d5], but 
this is the relevant part:

{code:scala}
streamingDatasetLeft
  .withWatermark("whenA", "1 second")
  .joinWith(
streamingDatasetRight.withWatermark("whenB", "1 second"),
($"idA" === $"idB") && $"whenB".between($"whenA" - interval, $"whenA" + 
interval),
"leftOuter"
  )
{code}


stack trace:

{noformat}
[error] org.apache.spark.sql.AnalysisException: Stream-stream LeftOuter join 
between two streaming DataFrame/Datasets is not supported without a watermark 
in the join keys, or a watermark on the nullable side and an appropriate range 
condition;
[error] Join LeftOuter, ((_1#40.idA = _2#41.idB) AND ((_2#41.whenB >= 
cast(whenA#9-T1000ms - INTERVAL '00.2' SECOND as timestamp)) AND (_2#41.whenB 
<= cast(_1#40.whenA + INTERVAL '00.2' SECOND as timestamp
[error] :- Project [named_struct(idA, idA#8, whenA, whenA#9-T1000ms) AS _1#40]
[error] :  +- EventTimeWatermark whenA#9: timestamp, 1 seconds
[error] : +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(assertnotnull(input[0, example.ExampleA, true])).idA, true, false) 
AS idA#8, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
TimestampType, fromJavaTimestamp, knownnotnull(assertnotnull(input[0, 
example.ExampleA, true])).whenA, true, false) AS whenA#9]
[error] :+- MapElements 
example.Main$$$Lambda$23191/0x0008061a5840@5af65edf, interface 
org.apache.spark.sql.Row, [StructField(timestamp,TimestampType,true), 
StructField(value,LongType,true)], obj#7: example.ExampleA
[error] :   +- DeserializeToObject createexternalrow(staticinvoke(class 
org.apache.spark.sql.catalyst.util.DateTimeUtils$, ObjectType(class 
java.sql.Timestamp), toJavaTimestamp, timestamp#0, true, false), value#1L, 
StructField(timestamp,TimestampType,true), StructField(value,LongType,true)), 
obj#6: org.apache.spark.sql.Row
[error] :  +- GlobalLimit 1
[error] : +- LocalLimit 1
[error] :+- StreamingRelationV2 
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider@715e4ad8, 
rate, 
org.apache.spark.sql.execution.streaming.sources.RateStreamTable@7963c608, 
[rowsPerSecond=1], [timestamp#0, value#1L]
[error] +- Project [named_struct(idB, idB#22, value, value#23, whenB, 
whenB#24-T1000ms) AS _2#41]
[error]+- EventTimeWatermark whenB#24: timestamp, 1 seconds
[error]   +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(assertnotnull(input[0, example.ExampleB, true])).idB, true, false) 
AS idB#22, knownnotnull(assertnotnull(input[0, example.ExampleB, true])).value 
AS value#23, staticinvoke(class 
org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, 
fromJavaTimestamp, knownnotnull(assertnotnull(input[0, example.ExampleB, 
true])).whenB, true, false) AS whenB#24]
[error]  +- MapElements 
example.Main$$$Lambda$23254/0x0008061da840@2a5a573b, interface 
org.apache.spark.sql.Row, [StructField(timestamp,TimestampType,true), 
StructField(value,LongType,true)], obj#21: example.ExampleB
[error] +- DeserializeToObject createexternalrow(staticinvoke(class 
org.apache.spark.sql.catalyst.util.DateTimeUtils$, ObjectType(class 
java.sql.Timestamp), toJavaTimestamp, timestamp#13, true, false), value#14L, 
StructField(timestamp,TimestampType,true), StructField(value,LongType,true)), 
obj#20: org.apache.spark.sql.Row
[error]+- GlobalLimit 1
[error]   +- LocalLimit 1
[error]  +- StreamingRelationV2 
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider@46bcce3d, 
rate, 
org.apache.spark.sql.execution.streaming.sources.RateStreamTable@25a4c0a2, 
[rowsPerSecond=1], [timestamp#13, value#14L]
[error]
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37395) Inline type hint files for files in python/pyspark/ml

2022-04-14 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37395.

Fix Version/s: 3.3.0
   Resolution: Fixed

> Inline type hint files for files in python/pyspark/ml
> -
>
> Key: SPARK-37395
> URL: https://issues.apache.org/jira/browse/SPARK-37395
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38905:


Assignee: Apache Spark

> Upgrade ORC to 1.6.14
> -
>
> Key: SPARK-38905
> URL: https://issues.apache.org/jira/browse/SPARK-38905
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38905:


Assignee: (was: Apache Spark)

> Upgrade ORC to 1.6.14
> -
>
> Key: SPARK-38905
> URL: https://issues.apache.org/jira/browse/SPARK-38905
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522511#comment-17522511
 ] 

Apache Spark commented on SPARK-38905:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36204

> Upgrade ORC to 1.6.14
> -
>
> Key: SPARK-38905
> URL: https://issues.apache.org/jira/browse/SPARK-38905
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522512#comment-17522512
 ] 

Apache Spark commented on SPARK-38905:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36204

> Upgrade ORC to 1.6.14
> -
>
> Key: SPARK-38905
> URL: https://issues.apache.org/jira/browse/SPARK-38905
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38905) Upgrade ORC to 1.6.14

2022-04-14 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-38905:
-

 Summary: Upgrade ORC to 1.6.14
 Key: SPARK-38905
 URL: https://issues.apache.org/jira/browse/SPARK-38905
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.1
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522509#comment-17522509
 ] 

Apache Spark commented on SPARK-37405:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/36203

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-04-14 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37405.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35530
[https://github.com/apache/spark/pull/35530]

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38667) Optimizer generates error when using inner join along with sequence

2022-04-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522500#comment-17522500
 ] 

Bruce Robbins commented on SPARK-38667:
---

Strangely, I cannot reproduce on 3.2.1 (or master). Maybe I am missing some 
configuration?

My optimized plan doesn't contain the size check in the {{Join}}:
{noformat}
== Optimized Logical Plan ==
Generate explode(sequence(a2#5, b2#13, Some(1), Some(America/Vancouver))), 
false, [x#25]
+- Join Inner, ((a2#5 < b2#13) AND (a1#4 = b1#12))
   :- LocalRelation [a1#4, a2#5]
   +- LocalRelation [b1#12, b2#13]
{noformat}

> Optimizer generates error when using inner join along with sequence
> ---
>
> Key: SPARK-38667
> URL: https://issues.apache.org/jira/browse/SPARK-38667
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Lars
>Priority: Major
>
> This issue occurred in a more complex scenario, so I've broken it down into a 
> simple case.
> {*}Steps to reproduce{*}: Execute the following example. The code should run 
> without errors, but instead a *java.lang.IllegalArgumentException: Illegal 
> sequence boundaries: 4 to 2 by 1* is thrown.
> {code:java}
> package com.example
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object SparkIssue {
>     def main(args: Array[String]): Unit = {
>         val spark = SparkSession
>             .builder()
>             .master("local[*]")
>             .getOrCreate()
>         val dfA = spark
>             .createDataFrame(Seq((1, 1), (2, 4)))
>             .toDF("a1", "a2")
>         val dfB = spark
>             .createDataFrame(Seq((1, 5), (2, 2)))
>             .toDF("b1", "b2")
>         dfA.join(dfB, dfA("a1") === dfB("b1"), "inner")
>             .where(col("a2") < col("b2"))
>             .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1
>             .show()
>         spark.stop()
>     }
> }
> {code}
> When I look at the Optimized Logical Plan I can see that the Inner Join and 
> the Filter are brought together, with an additional check for an empty 
> Sequence. The exception is thrown because the Sequence check is executed 
> before the Filter.
> {code:java}
> == Parsed Logical Plan ==
> 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), 
> None)) AS x#24]
> +- Filter (a2#5 < b2#13)
>    +- Join Inner, (a1#4 = b1#12)
>       :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>       :  +- LocalRelation [_1#0, _2#1]
>       +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>          +- LocalRelation [_1#8, _2#9]
> == Analyzed Logical Plan ==
> a1: int, a2: int, b1: int, b2: int, x: int
> Project [a1#4, a2#5, b1#12, b2#13, x#25]
> +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), 
> false, [x#25]
>    +- Filter (a2#5 < b2#13)
>       +- Join Inner, (a1#4 = b1#12)
>          :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>          :  +- LocalRelation [_1#0, _2#1]
>          +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>             +- LocalRelation [_1#8, _2#9]
> == Optimized Logical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, 
> [x#25]
> +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), 
> true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12))
>    :- LocalRelation [a1#4, a2#5]
>    +- LocalRelation [b1#12, b2#13]
> == Physical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, 
> a2#5, b1#12, b2#13], false, [x#25]
> +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, 
> ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND 
> (a2#5 < b2#13)), false
>    :- *(1) LocalTableScan [a1#4, a2#5]
>    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [id=#15]
>       +- LocalTableScan [b1#12, b2#13]
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38904:
--
Description: 
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/{{{}StructType{}}}

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]

  was:
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/{
Unknown macro: {{StructType}}

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]


> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *know* is compatible, I 
> could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/{{{}StructType{}}}
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in 
> [https://github.com/ravwojdyla/spark-schema-utils]
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38904:
--
Description: 
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/{
Unknown macro: {{StructType}}

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]

  was:
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/\{{StructType}}

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]


> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *know* is compatible, I 
> could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/{
> Unknown macro: {{StructType}}
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in 
> [https://github.com/ravwojdyla/spark-schema-utils]
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38904:
--
Description: 
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/\{{StructType}}

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]

  was:
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/`StructType`

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]


> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *know* is compatible, I 
> could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/\{{StructType}}
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in 
> [https://github.com/ravwojdyla/spark-schema-utils]
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38904:
--
Description: 
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *know* is compatible, I could 
do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/`StructType`

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in [https://github.com/ravwojdyla/spark-schema-utils]

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]

  was:
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *{*}know{*}* is compatible, I 
could do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/`StructType`

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]


> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *know* is compatible, I 
> could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/`StructType`
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in 
> [https://github.com/ravwojdyla/spark-schema-utils]
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38904:
--
Description: 
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *{*}know{*}* is compatible, I 
could do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, *unless* 
there's one already built into Spark that can generate query based on the 
schema/`StructType`

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]

  was:
This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *{*}know{*}* is compatible, I 
could do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, 
*{*}unless{*}* there's one already built into Spark that can generate query 
based on the schema/`StructType`

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]


> Low cost DataFrame schema swap util
> ---
>
> Key: SPARK-38904
> URL: https://issues.apache.org/jira/browse/SPARK-38904
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
> Let's assume I have a pyspark DataFrame with certain schema, and I would like 
> to overwrite that schema with a new schema that I *{*}know{*}* is compatible, 
> I could do:
> {code:python}
> df: DataFrame
> new_schema = ...
> df.rdd.toDF(schema=new_schema)
> {code}
> Unfortunately this triggers computation as described in the link above. Is 
> there a way to do that at the metadata level (or lazy), without eagerly 
> triggering computation or conversions?
> Edit, note:
>  * the schema can be arbitrarily complicated (nested etc)
>  * new schema includes updates to description, nullability and additional 
> metadata (bonus points for updates to the type)
>  * I would like to avoid writing a custom query expression generator, 
> *unless* there's one already built into Spark that can generate query based 
> on the schema/`StructType`
> Copied from: 
> [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]
> See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils
> Also posted in 
> [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38904) Low cost DataFrame schema swap util

2022-04-14 Thread Rafal Wojdyla (Jira)
Rafal Wojdyla created SPARK-38904:
-

 Summary: Low cost DataFrame schema swap util
 Key: SPARK-38904
 URL: https://issues.apache.org/jira/browse/SPARK-38904
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.1
Reporter: Rafal Wojdyla


This question is related to [https://stackoverflow.com/a/37090151/1661491]. 
Let's assume I have a pyspark DataFrame with certain schema, and I would like 
to overwrite that schema with a new schema that I *{*}know{*}* is compatible, I 
could do:
{code:python}
df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)
{code}
Unfortunately this triggers computation as described in the link above. Is 
there a way to do that at the metadata level (or lazy), without eagerly 
triggering computation or conversions?

Edit, note:
 * the schema can be arbitrarily complicated (nested etc)
 * new schema includes updates to description, nullability and additional 
metadata (bonus points for updates to the type)
 * I would like to avoid writing a custom query expression generator, 
*{*}unless{*}* there's one already built into Spark that can generate query 
based on the schema/`StructType`

Copied from: 
[https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan]

See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils

Also posted in 
[https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38891:


Assignee: (was: Apache Spark)

> Skipping allocating vector for repetition & definition levels when possible
> ---
>
> Key: SPARK-38891
> URL: https://issues.apache.org/jira/browse/SPARK-38891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently the vectorized Parquet reader will allocate vectors for repetition 
> and definition levels in all cases. However in certain cases (e.g., when 
> reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522458#comment-17522458
 ] 

Apache Spark commented on SPARK-38891:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/36202

> Skipping allocating vector for repetition & definition levels when possible
> ---
>
> Key: SPARK-38891
> URL: https://issues.apache.org/jira/browse/SPARK-38891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently the vectorized Parquet reader will allocate vectors for repetition 
> and definition levels in all cases. However in certain cases (e.g., when 
> reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522460#comment-17522460
 ] 

Apache Spark commented on SPARK-38891:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/36202

> Skipping allocating vector for repetition & definition levels when possible
> ---
>
> Key: SPARK-38891
> URL: https://issues.apache.org/jira/browse/SPARK-38891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently the vectorized Parquet reader will allocate vectors for repetition 
> and definition levels in all cases. However in certain cases (e.g., when 
> reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38891:


Assignee: Apache Spark

> Skipping allocating vector for repetition & definition levels when possible
> ---
>
> Key: SPARK-38891
> URL: https://issues.apache.org/jira/browse/SPARK-38891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently the vectorized Parquet reader will allocate vectors for repetition 
> and definition levels in all cases. However in certain cases (e.g., when 
> reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522422#comment-17522422
 ] 

Apache Spark commented on SPARK-38903:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36186

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38903:


Assignee: (was: Apache Spark)

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522421#comment-17522421
 ] 

Apache Spark commented on SPARK-38903:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36186

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38903:


Assignee: Apache Spark

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-04-14 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38903:


 Summary: Implement `ignore_index` of `Series.sort_values` and 
`Series.sort_index`
 Key: SPARK-38903
 URL: https://issues.apache.org/jira/browse/SPARK-38903
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38462) Use error classes in org.apache.spark.executor

2022-04-14 Thread huangtengfei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522331#comment-17522331
 ] 

huangtengfei commented on SPARK-38462:
--

I am working on this. Thanks [~bozhang]

> Use error classes in org.apache.spark.executor
> --
>
> Key: SPARK-38462
> URL: https://issues.apache.org/jira/browse/SPARK-38462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-14 Thread Mark Khaitman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Khaitman updated SPARK-38881:
--
Description: 
This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
merged as part of Spark 3.0.0

This change is desirable as it further exposes the metricsLevel config 
parameter that was added for the Scala/Java Spark APIs when working with the 
Kinesis Streaming integration, and makes it available to the PySpark API as 
well.

This change passes all tests, and local testing was done with a development 
Kinesis stream in AWS, in order to confirm that metrics were no longer being 
reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
Kinesis streaming context creation, and also worked as it does today when 
leaving the MetricsLevel parameter out, which would result in a default of 
DETAILED, with CloudWatch metrics appearing again.

https://github.com/apache/spark/pull/36201

 

  was:
This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
merged as part of Spark 3.0.0

This change is desirable as it further exposes the metricsLevel config 
parameter that was added for the Scala/Java Spark APIs when working with the 
Kinesis Streaming integration, and makes it available to the PySpark API as 
well.

This change passes all tests, and local testing was done with a development 
Kinesis stream in AWS, in order to confirm that metrics were no longer being 
reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
Kinesis streaming context creation, and also worked as it does today when 
leaving the MetricsLevel parameter out, which would result in a default of 
DETAILED, with CloudWatch metrics appearing again.

https://github.com/apache/spark/pull/36166

 


> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36201
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522308#comment-17522308
 ] 

Apache Spark commented on SPARK-38881:
--

User 'mkman84' has created a pull request for this issue:
https://github.com/apache/spark/pull/36201

> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36166
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522307#comment-17522307
 ] 

Apache Spark commented on SPARK-38881:
--

User 'mkman84' has created a pull request for this issue:
https://github.com/apache/spark/pull/36201

> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36166
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-04-14 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522294#comment-17522294
 ] 

Yang Jie commented on SPARK-3:
--

I provide a [draft pr |https://github.com/apache/spark/pull/36200]to refactor 
`LevelDB` usage in this scene,  that pr is pre-work of this jira,  the 
refactoring work will help us extend the use of `RocksDB`.

[~dongjoon] If you have time,  please help to check if this refactoring  work 
is acceptable, thanks ~

 

 

 

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38753) Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite

2022-04-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38753.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36196
[https://github.com/apache/spark/pull/36196]

> Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
> -
>
> Key: SPARK-38753
> URL: https://issues.apache.org/jira/browse/SPARK-38753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Move tests for the error class *WRITING_JOB_ABORTED* from DataSourceV2Suite 
> to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38753) Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite

2022-04-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38753:


Assignee: Max Gekk

> Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
> -
>
> Key: SPARK-38753
> URL: https://issues.apache.org/jira/browse/SPARK-38753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Move tests for the error class *WRITING_JOB_ABORTED* from DataSourceV2Suite 
> to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38722) Test the error class: CAST_CAUSES_OVERFLOW

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522263#comment-17522263
 ] 

Apache Spark commented on SPARK-38722:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36199

> Test the error class: CAST_CAUSES_OVERFLOW
> --
>
> Key: SPARK-38722
> URL: https://issues.apache.org/jira/browse/SPARK-38722
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CAST_CAUSES_OVERFLOW* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def castingCauseOverflowError(t: Any, dataType: DataType): 
> ArithmeticException = {
> new SparkArithmeticException(errorClass = "CAST_CAUSES_OVERFLOW",
>   messageParameters = Array(t.toString, dataType.catalogString, 
> SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38722) Test the error class: CAST_CAUSES_OVERFLOW

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38722:


Assignee: Apache Spark

> Test the error class: CAST_CAUSES_OVERFLOW
> --
>
> Key: SPARK-38722
> URL: https://issues.apache.org/jira/browse/SPARK-38722
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CAST_CAUSES_OVERFLOW* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def castingCauseOverflowError(t: Any, dataType: DataType): 
> ArithmeticException = {
> new SparkArithmeticException(errorClass = "CAST_CAUSES_OVERFLOW",
>   messageParameters = Array(t.toString, dataType.catalogString, 
> SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38722) Test the error class: CAST_CAUSES_OVERFLOW

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38722:


Assignee: (was: Apache Spark)

> Test the error class: CAST_CAUSES_OVERFLOW
> --
>
> Key: SPARK-38722
> URL: https://issues.apache.org/jira/browse/SPARK-38722
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CAST_CAUSES_OVERFLOW* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def castingCauseOverflowError(t: Any, dataType: DataType): 
> ArithmeticException = {
> new SparkArithmeticException(errorClass = "CAST_CAUSES_OVERFLOW",
>   messageParameters = Array(t.toString, dataType.catalogString, 
> SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38902) cast as char/varchar result is string, not expect data type

2022-04-14 Thread YuanGuanhu (Jira)
YuanGuanhu created SPARK-38902:
--

 Summary: cast as char/varchar result is string, not expect data 
type
 Key: SPARK-38902
 URL: https://issues.apache.org/jira/browse/SPARK-38902
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.3.0
Reporter: YuanGuanhu


when cast column to char/varchar type, result is string, not expected data type



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38901) Support for more misc functions to push down to data source

2022-04-14 Thread Zhixiong Chen (Jira)
Zhixiong Chen created SPARK-38901:
-

 Summary: Support for more misc functions to push down to data 
source
 Key: SPARK-38901
 URL: https://issues.apache.org/jira/browse/SPARK-38901
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Zhixiong Chen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38900) Support for more collection functions to push down to data source

2022-04-14 Thread Zhixiong Chen (Jira)
Zhixiong Chen created SPARK-38900:
-

 Summary: Support for more collection functions to push down to 
data source
 Key: SPARK-38900
 URL: https://issues.apache.org/jira/browse/SPARK-38900
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Zhixiong Chen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38899) Support for more datetime functions to push down to data source

2022-04-14 Thread Zhixiong Chen (Jira)
Zhixiong Chen created SPARK-38899:
-

 Summary: Support for more datetime functions to push down to data 
source
 Key: SPARK-38899
 URL: https://issues.apache.org/jira/browse/SPARK-38899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Zhixiong Chen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38898:


Assignee: Apache Spark

> Failed to build python docker images due to .cache not found
> 
>
> Key: SPARK-38898
> URL: https://issues.apache.org/jira/browse/SPARK-38898
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> rm: cannot remove '/root/.cache': No such file or directory
> Related:
> [https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522203#comment-17522203
 ] 

Apache Spark commented on SPARK-38898:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36198

> Failed to build python docker images due to .cache not found
> 
>
> Key: SPARK-38898
> URL: https://issues.apache.org/jira/browse/SPARK-38898
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> rm: cannot remove '/root/.cache': No such file or directory
> Related:
> [https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38898:


Assignee: (was: Apache Spark)

> Failed to build python docker images due to .cache not found
> 
>
> Key: SPARK-38898
> URL: https://issues.apache.org/jira/browse/SPARK-38898
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> rm: cannot remove '/root/.cache': No such file or directory
> Related:
> [https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522201#comment-17522201
 ] 

Apache Spark commented on SPARK-38898:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36198

> Failed to build python docker images due to .cache not found
> 
>
> Key: SPARK-38898
> URL: https://issues.apache.org/jira/browse/SPARK-38898
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> rm: cannot remove '/root/.cache': No such file or directory
> Related:
> [https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38898) Failed to build python docker images due to .cache not found

2022-04-14 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-38898:
---

 Summary: Failed to build python docker images due to .cache not 
found
 Key: SPARK-38898
 URL: https://issues.apache.org/jira/browse/SPARK-38898
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Yikun Jiang


rm: cannot remove '/root/.cache': No such file or directory

Related:

[https://github.com/volcano-sh/volcano/runs/6020604500?check_suite_focus=true#step:10:2381]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38897) Support for more string functions to push down to data source

2022-04-14 Thread Zhixiong Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhixiong Chen updated SPARK-38897:
--
Parent: SPARK-38852
Issue Type: Sub-task  (was: Improvement)

> Support for more string functions to push down to data source
> -
>
> Key: SPARK-38897
> URL: https://issues.apache.org/jira/browse/SPARK-38897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Zhixiong Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38897) Support for more string functions to push down to data source

2022-04-14 Thread Zhixiong Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhixiong Chen updated SPARK-38897:
--
Summary: Support for more string functions to push down to data source  
(was: More string functions support to pushdown to data source)

> Support for more string functions to push down to data source
> -
>
> Key: SPARK-38897
> URL: https://issues.apache.org/jira/browse/SPARK-38897
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Zhixiong Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38897) More string functions support to pushdown to data source

2022-04-14 Thread Zhixiong Chen (Jira)
Zhixiong Chen created SPARK-38897:
-

 Summary: More string functions support to pushdown to data source
 Key: SPARK-38897
 URL: https://issues.apache.org/jira/browse/SPARK-38897
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Zhixiong Chen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-04-14 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522176#comment-17522176
 ] 

Rafal Wojdyla commented on SPARK-38438:
---

For posterity see the context in 
https://lists.apache.org/thread/42rsmbyqc5p1zfv956rwz4wk9lhj4s6w. 

[~srowen] thanks for the comment, feel free to close this issue if you believe 
there's no chance of getting this one in.

> Can't update spark.jars.packages on existing global/default context
> ---
>
> Key: SPARK-38438
> URL: https://issues.apache.org/jira/browse/SPARK-38438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: py: 3.9
> spark: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Minor
>
> Reproduction:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # later on we want to update jars.packages, here's e.g. spark-hats
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # line below returns None, the config was not propagated:
> s._sc._conf.get("spark.jars.packages")
> {code}
> Stopping the context doesn't help, in fact it's even more confusing, because 
> the configuration is updated, but doesn't have an effect:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> s.stop()
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
> # doesn't download the jar/package, as it would if there was no global context
> # thus the extra package is unusable. It's not downloaded, or added to the
> # classpath.
> s._sc._conf.get("spark.jars.packages")
> {code}
> One workaround is to stop the context AND kill the JVM gateway, which seems 
> to be a kind of hard reset:
> {code:python}
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # Hard reset:
> s.stop()
> s._sc._gateway.shutdown()
> s._sc._gateway.proc.stdin.close()
> SparkContext._gateway = None
> SparkContext._jvm = None
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # Now we are guaranteed there's a new spark session, and packages
> # are downloaded, added to the classpath etc.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38747) Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38747:


Assignee: Apache Spark

> Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite
> --
>
> Key: SPARK-38747
> URL: https://issues.apache.org/jira/browse/SPARK-38747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Move tests for the error class *PARSE_SYNTAX_ERROR* from ErrorParserSuite to 
> QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38747) Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38747:


Assignee: (was: Apache Spark)

> Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite
> --
>
> Key: SPARK-38747
> URL: https://issues.apache.org/jira/browse/SPARK-38747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Move tests for the error class *PARSE_SYNTAX_ERROR* from ErrorParserSuite to 
> QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38747) Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522169#comment-17522169
 ] 

Apache Spark commented on SPARK-38747:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36197

> Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite
> --
>
> Key: SPARK-38747
> URL: https://issues.apache.org/jira/browse/SPARK-38747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Move tests for the error class *PARSE_SYNTAX_ERROR* from ErrorParserSuite to 
> QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38894:


Assignee: Hyukjin Kwon

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38894.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36191
[https://github.com/apache/spark/pull/36191]

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38893) Test SourceProgress in PySpark

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38893:


Assignee: Hyukjin Kwon

> Test SourceProgress in PySpark
> --
>
> Key: SPARK-38893
> URL: https://issues.apache.org/jira/browse/SPARK-38893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> There was a mistake and we're not testing SourceProgress (see 
> https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
>  We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38893) Test SourceProgress in PySpark

2022-04-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38893.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36190
[https://github.com/apache/spark/pull/36190]

> Test SourceProgress in PySpark
> --
>
> Key: SPARK-38893
> URL: https://issues.apache.org/jira/browse/SPARK-38893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> There was a mistake and we're not testing SourceProgress (see 
> https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
>  We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38753) Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522145#comment-17522145
 ] 

Apache Spark commented on SPARK-38753:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36196

> Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
> -
>
> Key: SPARK-38753
> URL: https://issues.apache.org/jira/browse/SPARK-38753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Move tests for the error class *WRITING_JOB_ABORTED* from DataSourceV2Suite 
> to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38753) Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522144#comment-17522144
 ] 

Apache Spark commented on SPARK-38753:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36196

> Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
> -
>
> Key: SPARK-38753
> URL: https://issues.apache.org/jira/browse/SPARK-38753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Move tests for the error class *WRITING_JOB_ABORTED* from DataSourceV2Suite 
> to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38753) Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38753:


Assignee: Apache Spark

> Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
> -
>
> Key: SPARK-38753
> URL: https://issues.apache.org/jira/browse/SPARK-38753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Move tests for the error class *WRITING_JOB_ABORTED* from DataSourceV2Suite 
> to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38753) Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38753:


Assignee: (was: Apache Spark)

> Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
> -
>
> Key: SPARK-38753
> URL: https://issues.apache.org/jira/browse/SPARK-38753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Move tests for the error class *WRITING_JOB_ABORTED* from DataSourceV2Suite 
> to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38896) Use tryWithResource to recycling KVStoreIterator and remove finalize() from LevelDB/RocksDB

2022-04-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38896:


Assignee: Apache Spark

> Use tryWithResource to recycling KVStoreIterator and remove finalize() from 
> LevelDB/RocksDB
> ---
>
> Key: SPARK-38896
> URL: https://issues.apache.org/jira/browse/SPARK-38896
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Use `Utils.tryWithResource` to recycling all  `KVStoreIterator` opened by 
> RocksDB/LevelDB and remove `finalize()` medho from LevelDB/RocksDB



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38896) Use tryWithResource to recycling KVStoreIterator and remove finalize() from LevelDB/RocksDB

2022-04-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522123#comment-17522123
 ] 

Apache Spark commented on SPARK-38896:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36195

> Use tryWithResource to recycling KVStoreIterator and remove finalize() from 
> LevelDB/RocksDB
> ---
>
> Key: SPARK-38896
> URL: https://issues.apache.org/jira/browse/SPARK-38896
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Use `Utils.tryWithResource` to recycling all  `KVStoreIterator` opened by 
> RocksDB/LevelDB and remove `finalize()` medho from LevelDB/RocksDB



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >