[jira] [Commented] (SPARK-37265) Support Java 17 in `dev/test-dependencies.sh`

2021-11-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441523#comment-17441523
 ] 

Dongjoon Hyun commented on SPARK-37265:
---

Shall we close this, [~sarutak]?

> Support Java 17 in `dev/test-dependencies.sh`
> -
>
> Key: SPARK-37265
> URL: https://issues.apache.org/jira/browse/SPARK-37265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37264) [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from orc-core

2021-11-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37264.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34541

> [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from 
> orc-core
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> Like hadoop-common and hadoop-hdfs, this PR proposes to exclude 
> hadoop-client-api transitive dependency from orc-core.
> Why are the changes needed?
> Since Apache Hadoop 2.7 doesn't work on Java 17, Apache ORC has a dependency 
> on Hadoop 3.3.1.
> This causes test-dependencies.sh failure on Java 17. As a result, 
> run-tests.py also fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37264) [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from orc-core

2021-11-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37264:
---
Description: 
Like hadoop-common and hadoop-hdfs, this PR proposes to exclude 
hadoop-client-api transitive dependency from orc-core.
Why are the changes needed?

Since Apache Hadoop 2.7 doesn't work on Java 17, Apache ORC has a dependency on 
Hadoop 3.3.1.
This causes test-dependencies.sh failure on Java 17. As a result, run-tests.py 
also fails.

  was:
In the current master, `run-tests.py` fails on Java 17 due to 
`test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
dependency on hadoop-client-api:3.3.1 only for Java 17.
Hadoop 2.7 doesn't support Java 17 so let's 


> [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from 
> orc-core
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Like hadoop-common and hadoop-hdfs, this PR proposes to exclude 
> hadoop-client-api transitive dependency from orc-core.
> Why are the changes needed?
> Since Apache Hadoop 2.7 doesn't work on Java 17, Apache ORC has a dependency 
> on Hadoop 3.3.1.
> This causes test-dependencies.sh failure on Java 17. As a result, 
> run-tests.py also fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37264) Cut the transitive dependency on hadoop-client-api which orc-shims depends on only for Java 17 with hadoop-2.7

2021-11-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37264:
---
Description: 
In the current master, `run-tests.py` fails on Java 17 due to 
`test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
dependency on hadoop-client-api:3.3.1 only for Java 17.
Hadoop 2.7 doesn't support Java 17 so let's 

  was:
In the current master, `run-tests.py` fails on Java 17 due to 
`test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
dependency on hadoop-client-api:3.3.1 only for Java 17.
Currently, we don't maintain the dependency manifests for Java 17 yet so let's 
skip it temporarily.


> Cut the transitive dependency on hadoop-client-api which orc-shims depends on 
> only for Java 17 with hadoop-2.7
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Hadoop 2.7 doesn't support Java 17 so let's 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37264) [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from orc-core

2021-11-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37264:
---
Summary: [SPARK-37264][BUILD] Exclude hadoop-client-api transitive 
dependency from orc-core  (was: Cut the transitive dependency on 
hadoop-client-api which orc-shims depends on only for Java 17 with hadoop-2.7)

> [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from 
> orc-core
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Hadoop 2.7 doesn't support Java 17 so let's 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37264) Cut the transitive dependency on hadoop-client-api which orc-shims depends on only for Java 17 with hadoop-2.7

2021-11-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37264:
---
Summary: Cut the transitive dependency on hadoop-client-api which orc-shims 
depends on only for Java 17 with hadoop-2.7  (was: Skip dependency testing on 
Java 17 temporarily)

> Cut the transitive dependency on hadoop-client-api which orc-shims depends on 
> only for Java 17 with hadoop-2.7
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Currently, we don't maintain the dependency manifests for Java 17 yet so 
> let's skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi reassigned SPARK-36575:


Assignee: hujiahua

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Assignee: hujiahua
>Priority: Major
> Fix For: 3.3.0
>
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34230) Let AQE determine the right parallelism in DistributionAndOrderingUtils

2021-11-09 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441509#comment-17441509
 ] 

Anton Okolnychyi commented on SPARK-34230:
--

We should also double check what impact SPARK-36315 had on this.

> Let AQE determine the right parallelism in DistributionAndOrderingUtils
> ---
>
> Key: SPARK-34230
> URL: https://issues.apache.org/jira/browse/SPARK-34230
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should let AQE determine the right parallelism in 
> \{{DistributionAndOrderingUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-09 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441507#comment-17441507
 ] 

angerszhu commented on SPARK-37201:
---

[~Kotlov] For case two, caused by SchemaPruning executed before 
PredicatePushDown.

> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}
>  
> *Yet another example: left_anti join after double select:*
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, field1: String, field2: String)
> Seq(
>   Event(Struct("v1","v2","v3"), "fld1", "fld2")
> ).toDF().write.mode("overwrite").saveAsTable("table")
> val joinDf = Seq("id1").toDF("id")
> spark.table("table")
>   .select("struct", "field1")
>   .select($"struct.v1", $"field1")
>   .join(joinDf, $"field1" === joinDf("id"), "left_anti")
>   .explain(true)
> // ===> ReadSchema: 
> struct,field1:string>
> {code}
> Instead of the first select, it can be other types of manipulations with the 
> original df, for example {color:#00875a}.withColumn("field3", 
> lit("f3")){color} or {color:#00875a}.drop("field2"){color}, which will also 
> lead to reading unnecessary nested fields from _struct_.
> But if you just remove the first select or change type of join, reading 
> nested fields will be correct:
> {code:java}
> // .select("struct", "field1")
> ===> ReadSchema: struct,field1:string>
> .join(joinDf, $"field1" === joinDf("id"), "left")
> ===> ReadSchema: struct,field1:string>
> {code}
> PS: The first select might look strange in the context of this example, but 
> in a real system, it might be part of a common api, that other parts of the 
> system use with their own expressions on top of this api.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37264) Skip dependency testing on Java 17 temporarily

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37264:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Skip dependency testing on Java 17 temporarily
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Currently, we don't maintain the dependency manifests for Java 17 yet so 
> let's skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37264) Skip dependency testing on Java 17 temporarily

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441505#comment-17441505
 ] 

Apache Spark commented on SPARK-37264:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34541

> Skip dependency testing on Java 17 temporarily
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Currently, we don't maintain the dependency manifests for Java 17 yet so 
> let's skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37264) Skip dependency testing on Java 17 temporarily

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441504#comment-17441504
 ] 

Apache Spark commented on SPARK-37264:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34541

> Skip dependency testing on Java 17 temporarily
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Currently, we don't maintain the dependency manifests for Java 17 yet so 
> let's skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37264) Skip dependency testing on Java 17 temporarily

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37264:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Skip dependency testing on Java 17 temporarily
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Currently, we don't maintain the dependency manifests for Java 17 yet so 
> let's skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37264) Skip dependency testing on Java 17 temporarily

2021-11-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37264:
---
Description: 
In the current master, `run-tests.py` fails on Java 17 due to 
`test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
dependency on hadoop-client-api:3.3.1 only for Java 17.
Currently, we don't maintain the dependency manifests for Java 17 yet so let's 
skip it temporarily.

  was:
In the current master, test-dependencies.sh fails on Java 17 because 
orc-shims:1.7.1 has a compile dependency on hadoop-client-api:3.3.1 only for 
Java 17.

Currently, we don't maintain the dependency manifests for Java 17 yet so let's 
skip it temporarily.


> Skip dependency testing on Java 17 temporarily
> --
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, `run-tests.py` fails on Java 17 due to 
> `test-dependencies.sh` fails. The cause is orc-shims:1.7.1 has a compile 
> dependency on hadoop-client-api:3.3.1 only for Java 17.
> Currently, we don't maintain the dependency manifests for Java 17 yet so 
> let's skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37265) Support Java 17 in `dev/test-dependencies.sh`

2021-11-09 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37265:
--

 Summary: Support Java 17 in `dev/test-dependencies.sh`
 Key: SPARK-37265
 URL: https://issues.apache.org/jira/browse/SPARK-37265
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.3.0
Reporter: Kousuke Saruta






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37264) Skip dependency testing on Java 17 temporarily

2021-11-09 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37264:
--

 Summary: Skip dependency testing on Java 17 temporarily
 Key: SPARK-37264
 URL: https://issues.apache.org/jira/browse/SPARK-37264
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, test-dependencies.sh fails on Java 17 because 
orc-shims:1.7.1 has a compile dependency on hadoop-client-api:3.3.1 only for 
Java 17.

Currently, we don't maintain the dependency manifests for Java 17 yet so let's 
skip it temporarily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid

2021-11-09 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441494#comment-17441494
 ] 

dch nguyen edited comment on SPARK-37260 at 11/10/21, 4:03 AM:
---

ping [~hyukjin.kwon] , is this issue resolved by 
[#34475|https://github.com/apache/spark/pull/34475]?


was (Author: dchvn):
[~hyukjin.kwon] , is this issue resolved by 
[#34475|https://github.com/apache/spark/pull/34475]?

> PYSPARK Arrow 3.2.0 docs link invalid
> -
>
> Key: SPARK-37260
> URL: https://issues.apache.org/jira/browse/SPARK-37260
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html]
> links to:
> [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html]
> which links to:
> [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst]
> But that is an invalid link.
> I assume its supposed to point to:
> https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid

2021-11-09 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441494#comment-17441494
 ] 

dch nguyen commented on SPARK-37260:


[~hyukjin.kwon] , is this issue resolved by 
[#34475|https://github.com/apache/spark/pull/34475]?

> PYSPARK Arrow 3.2.0 docs link invalid
> -
>
> Key: SPARK-37260
> URL: https://issues.apache.org/jira/browse/SPARK-37260
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html]
> links to:
> [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html]
> which links to:
> [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst]
> But that is an invalid link.
> I assume its supposed to point to:
> https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi reassigned SPARK-36575:


Assignee: (was: wuyi)

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Priority: Major
> Fix For: 3.3.0
>
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441486#comment-17441486
 ] 

wuyi commented on SPARK-36575:
--

Issue resolved by [https://github.com/apache/spark/pull/33872.]

 

To clarify, the fix doesn't really fix a hang issue but an improvement. The 
hanging issue doesn't exist in Master branch but only 2.3 (is confirmed)

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Assignee: wuyi
>Priority: Major
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36575:
-
Issue Type: Improvement  (was: Bug)

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Assignee: wuyi
>Priority: Major
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36575:
-
Fix Version/s: 3.3.0

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Assignee: wuyi
>Priority: Major
> Fix For: 3.3.0
>
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi reassigned SPARK-36575:


Assignee: wuyi

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Assignee: wuyi
>Priority: Major
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-09 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-36575.
--
Resolution: Fixed

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Priority: Major
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37263) Reduce pandas-on-Spark warning for internal usage.

2021-11-09 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-37263:
---

 Summary: Reduce pandas-on-Spark warning for internal usage.
 Key: SPARK-37263
 URL: https://issues.apache.org/jira/browse/SPARK-37263
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Haejoon Lee


Raised from comment 
https://github.com/apache/spark/pull/34389#discussion_r741733023.

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
 now issuing too much warning message, since it also issuing the warning when 
the APIs are used for internal usage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37262) Not log empty aggregate and group by in JDBCScan

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37262:


Assignee: (was: Apache Spark)

> Not log empty aggregate and group by in JDBCScan
> 
>
> Key: SPARK-37262
> URL: https://issues.apache.org/jira/browse/SPARK-37262
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Address the comment 
> https://github.com/apache/spark/pull/34451#discussion_r740220800
> Current behavior:
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
> struct
> {code}
> After the fix, it will be
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37262) Not log empty aggregate and group by in JDBCScan

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37262:


Assignee: Apache Spark

> Not log empty aggregate and group by in JDBCScan
> 
>
> Key: SPARK-37262
> URL: https://issues.apache.org/jira/browse/SPARK-37262
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Address the comment 
> https://github.com/apache/spark/pull/34451#discussion_r740220800
> Current behavior:
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
> struct
> {code}
> After the fix, it will be
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37262) Not log empty aggregate and group by in JDBCScan

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441429#comment-17441429
 ] 

Apache Spark commented on SPARK-37262:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34540

> Not log empty aggregate and group by in JDBCScan
> 
>
> Key: SPARK-37262
> URL: https://issues.apache.org/jira/browse/SPARK-37262
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Address the comment 
> https://github.com/apache/spark/pull/34451#discussion_r740220800
> Current behavior:
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
> struct
> {code}
> After the fix, it will be
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441425#comment-17441425
 ] 

Apache Spark commented on SPARK-37120:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34539

> Add Java17 GitHub Action build and test job
> ---
>
> Key: SPARK-37120
> URL: https://issues.apache.org/jira/browse/SPARK-37120
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Now run
> {code:java}
> build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> {code}
> to build and test whole project(Head is 
> 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
> UTs have passed.
>  
> {code:java}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  1.971 
> s]
> [INFO] Spark Project Tags . SUCCESS [  2.170 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 14.008 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.466 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 49.650 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  1.826 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  1.851 
> s]
> [INFO] Spark Project Core . SUCCESS [24:40 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [01:27 
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [07:56 
> min]
> [INFO] Spark Project SQL .. SUCCESS [  01:01 
> h]
> [INFO] Spark Project ML Library ... SUCCESS [16:46 
> min]
> [INFO] Spark Project Tools  SUCCESS [  0.748 
> s]
> [INFO] Spark Project Hive . SUCCESS [  01:11 
> h]
> [INFO] Spark Project REPL . SUCCESS [01:26 
> min]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 
> s]
> [INFO] Spark Project YARN . SUCCESS [06:54 
> min]
> [INFO] Spark Project Mesos  SUCCESS [ 46.913 
> s]
> [INFO] Spark Project Kubernetes ... SUCCESS [01:08 
> min]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 
> min]
> [INFO] Spark Ganglia Integration .. SUCCESS [  4.610 
> s]
> [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 
> s]
> [INFO] Spark Project Assembly . SUCCESS [  2.496 
> s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 
> s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 
> min]
> [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 
> s]
> [INFO] Spark Project Examples . SUCCESS [ 32.189 
> s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 
> s]
> [INFO] Spark Avro . SUCCESS [01:55 
> min]
> [INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 
> s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  04:19 h
> [INFO] Finished at: 2021-10-26T20:02:56+08:00
> [INFO] 
> 
> {code}
> So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37262) Not log empty aggregate and group by in JDBCScan

2021-11-09 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-37262:
---
Description: 
Address the comment 
https://github.com/apache/spark/pull/34451#discussion_r740220800

Current behavior:

{code:java}
Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
[NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
struct
{code}


After the fix, it will be

{code:java}
Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
[NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
GreaterThan(SALARY,100.00)], ReadSchema: 
struct
{code}



  was:
Address the comment 
https://github.com/apache/spark/pull/34451#discussion_r740220800

Current behavior:
Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
[NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
struct
After the fix, it will be
Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
[NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
GreaterThan(SALARY,100.00)], ReadSchema: 
struct



> Not log empty aggregate and group by in JDBCScan
> 
>
> Key: SPARK-37262
> URL: https://issues.apache.org/jira/browse/SPARK-37262
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Address the comment 
> https://github.com/apache/spark/pull/34451#discussion_r740220800
> Current behavior:
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
> struct
> {code}
> After the fix, it will be
> {code:java}
> Scan 
> org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
> [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
> GreaterThan(SALARY,100.00)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37262) Not log empty aggregate and group by in JDBCScan

2021-11-09 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-37262:
--

 Summary: Not log empty aggregate and group by in JDBCScan
 Key: SPARK-37262
 URL: https://issues.apache.org/jira/browse/SPARK-37262
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: Huaxin Gao


Address the comment 
https://github.com/apache/spark/pull/34451#discussion_r740220800

Current behavior:
Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
[NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), 
GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: 
struct
After the fix, it will be
Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@72e75786 
[NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), 
GreaterThan(SALARY,100.00)], ReadSchema: 
struct




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37221) The collect-like API in SparkPlan should support columnar output

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441339#comment-17441339
 ] 

Apache Spark commented on SPARK-37221:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/34538

> The collect-like API in SparkPlan should support columnar output
> 
>
> Key: SPARK-37221
> URL: https://issues.apache.org/jira/browse/SPARK-37221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, the collect-like APIs in SparkPlan, e.g. executeCollect, do not 
> work if the plan is columnar execution.
> We should extend the API coverage to all execution modes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37261) Check adding partitions with ANSI intervals

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37261:


Assignee: Apache Spark  (was: Max Gekk)

> Check adding partitions with ANSI intervals
> ---
>
> Key: SPARK-37261
> URL: https://issues.apache.org/jira/browse/SPARK-37261
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add tests that should check adding partitions with ANSI intervals via the 
> ALTER TABLE command.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37261) Check adding partitions with ANSI intervals

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441313#comment-17441313
 ] 

Apache Spark commented on SPARK-37261:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34537

> Check adding partitions with ANSI intervals
> ---
>
> Key: SPARK-37261
> URL: https://issues.apache.org/jira/browse/SPARK-37261
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add tests that should check adding partitions with ANSI intervals via the 
> ALTER TABLE command.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37261) Check adding partitions with ANSI intervals

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37261:


Assignee: Max Gekk  (was: Apache Spark)

> Check adding partitions with ANSI intervals
> ---
>
> Key: SPARK-37261
> URL: https://issues.apache.org/jira/browse/SPARK-37261
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add tests that should check adding partitions with ANSI intervals via the 
> ALTER TABLE command.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37261) Check adding partitions with ANSI intervals

2021-11-09 Thread Max Gekk (Jira)
Max Gekk created SPARK-37261:


 Summary: Check adding partitions with ANSI intervals
 Key: SPARK-37261
 URL: https://issues.apache.org/jira/browse/SPARK-37261
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Max Gekk


Add tests that should check adding partitions with ANSI intervals via the ALTER 
TABLE command.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid

2021-11-09 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-37260:
-

 Summary: PYSPARK Arrow 3.2.0 docs link invalid
 Key: SPARK-37260
 URL: https://issues.apache.org/jira/browse/SPARK-37260
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.2.0
Reporter: Thomas Graves


[http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html]

links to:

[https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html]

which links to:

[https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst]

But that is an invalid link.

I assume its supposed to point to:

https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37246) Run PyArrow tests on Python 3.10

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37246:
--
Issue Type: Bug  (was: Umbrella)

> Run PyArrow tests on Python 3.10
> 
>
> Key: SPARK-37246
> URL: https://issues.apache.org/jira/browse/SPARK-37246
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Though pyarrow doesn’t support python 3.10 yet 
> [https://pypi.org/project/pyarrow/,] we wanted to adjust PySpark codes 
> gradually to support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37246) Run PyArrow tests on Python 3.10

2021-11-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441257#comment-17441257
 ] 

Dongjoon Hyun commented on SPARK-37246:
---

Never mind. This is also valuable. I converted this umbrella JIRA into a new 
issue `Run PyArrow tests on Python 3.10` to reuse the JIRA ID.

> Run PyArrow tests on Python 3.10
> 
>
> Key: SPARK-37246
> URL: https://issues.apache.org/jira/browse/SPARK-37246
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Though pyarrow doesn’t support python 3.10 yet 
> [https://pypi.org/project/pyarrow/,] we wanted to adjust PySpark codes 
> gradually to support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-37246) Run PyArrow tests on Python 3.10

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-37246:
---

> Run PyArrow tests on Python 3.10
> 
>
> Key: SPARK-37246
> URL: https://issues.apache.org/jira/browse/SPARK-37246
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Though pyarrow doesn’t support python 3.10 yet 
> [https://pypi.org/project/pyarrow/,] we wanted to adjust PySpark codes 
> gradually to support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-37247) Failed test_create_nan_decimal_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests)

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37247.
-

> Failed test_create_nan_decimal_dataframe 
> (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
>
> Key: SPARK-37247
> URL: https://issues.apache.org/jira/browse/SPARK-37247
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
>   File 
> "/Users/xinrong.meng/spark/python/pyspark/sql/tests/test_dataframe.py", line 
> 957, in test_create_nan_decimal_dataframe
>     self.spark.createDataFrame(data=[Decimal('NaN')], 
> schema='decimal').collect(),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py", line 751, 
> in collect
>     sock_info = self._jdf.collectToPython()
>   File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
>  line 1309, in __call__
>     return_value = get_return_value(
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/utils.py", line 178, in 
> deco
>     return f(*a, **kw)
>   File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py",
>  line 326, in get_return_value
>     raise Py4JJavaError(
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o135.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 2.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2.0 
> (TID 7) (172.16.203.223 executor driver): 
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37246) Run PyArrow tests on Python 3.10

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37246:
--
Summary: Run PyArrow tests on Python 3.10  (was: Support Python 3.10 in 
PySpark)

> Run PyArrow tests on Python 3.10
> 
>
> Key: SPARK-37246
> URL: https://issues.apache.org/jira/browse/SPARK-37246
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Though pyarrow doesn’t support python 3.10 yet 
> [https://pypi.org/project/pyarrow/,] we wanted to adjust PySpark codes 
> gradually to support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37247) Failed test_create_nan_decimal_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests)

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37247:
--
Parent: (was: SPARK-37246)
Issue Type: Bug  (was: Sub-task)

> Failed test_create_nan_decimal_dataframe 
> (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
>
> Key: SPARK-37247
> URL: https://issues.apache.org/jira/browse/SPARK-37247
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
>   File 
> "/Users/xinrong.meng/spark/python/pyspark/sql/tests/test_dataframe.py", line 
> 957, in test_create_nan_decimal_dataframe
>     self.spark.createDataFrame(data=[Decimal('NaN')], 
> schema='decimal').collect(),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py", line 751, 
> in collect
>     sock_info = self._jdf.collectToPython()
>   File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
>  line 1309, in __call__
>     return_value = get_return_value(
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/utils.py", line 178, in 
> deco
>     return f(*a, **kw)
>   File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py",
>  line 326, in get_return_value
>     raise Py4JJavaError(
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o135.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 2.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2.0 
> (TID 7) (172.16.203.223 executor driver): 
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37248) Failed test_make_date (pyspark.sql.tests.test_functions.FunctionsTests)

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37248:
--
Parent: (was: SPARK-37246)
Issue Type: Bug  (was: Sub-task)

> Failed test_make_date (pyspark.sql.tests.test_functions.FunctionsTests)
> ---
>
> Key: SPARK-37248
> URL: https://issues.apache.org/jira/browse/SPARK-37248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/xinrong.meng/spark/python/pyspark/sql/tests/test_functions.py", line 
> 247, in test_make_date
>     row_from_col = df.select(make_date(df.Y, df.M, df.D)).first()
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/functions.py", line 
> 2159, in make_date
>     jc = sc._jvm.functions.make_date(year_col, month_col, day_col)
>   File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
>  line 1535, in __getattr__
>     raise Py4JError(
> py4j.protocol.Py4JError: org.apache.spark.sql.functions.make_date does not 
> exist in the JVM
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-37248) Failed test_make_date (pyspark.sql.tests.test_functions.FunctionsTests)

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37248.
-

> Failed test_make_date (pyspark.sql.tests.test_functions.FunctionsTests)
> ---
>
> Key: SPARK-37248
> URL: https://issues.apache.org/jira/browse/SPARK-37248
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/xinrong.meng/spark/python/pyspark/sql/tests/test_functions.py", line 
> 247, in test_make_date
>     row_from_col = df.select(make_date(df.Y, df.M, df.D)).first()
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/functions.py", line 
> 2159, in make_date
>     jc = sc._jvm.functions.make_date(year_col, month_col, day_col)
>   File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
>  line 1535, in __getattr__
>     raise Py4JError(
> py4j.protocol.Py4JError: org.apache.spark.sql.functions.make_date does not 
> exist in the JVM
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37249) ImportError: cannot import name 'Callable' from 'collections'

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37249:
--
Parent: (was: SPARK-37246)
Issue Type: Bug  (was: Sub-task)

> ImportError: cannot import name 'Callable' from 'collections'
> -
>
> Key: SPARK-37249
> URL: https://issues.apache.org/jira/browse/SPARK-37249
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37250) Failed test_capture_user_friendly_exception (pyspark.sql.tests.test_utils.UtilsTests)

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37250:
--
Parent: (was: SPARK-37246)
Issue Type: Bug  (was: Sub-task)

> Failed test_capture_user_friendly_exception 
> (pyspark.sql.tests.test_utils.UtilsTests)
> -
>
> Key: SPARK-37250
> URL: https://issues.apache.org/jira/browse/SPARK-37250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/tests/test_utils.py", 
> line 34, in test_capture_user_friendly_exception
>     self.spark.sql("select `中文字段`")
> pyspark.sql.utils.AnalysisException: cannot resolve '`中文字段`' given input 
> columns: []; line 1 7;
> 'Project ['中文字段]
> +- OneRowRelation {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-37251) Failed _joinAsOf doctest

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37251.
-

> Failed _joinAsOf doctest
> 
>
> Key: SPARK-37251
> URL: https://issues.apache.org/jira/browse/SPARK-37251
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> File "/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py", line 1523, 
> in pyspark.sql.dataframe.DataFrame._joinAsOf
> Failed example:
>     left._joinAsOf(
>         right, leftAsOfColumn="a", rightAsOfColumn="a", how="left", 
> tolerance=F.lit(1)
>     ).select(left.a, 'left_val', 'right_val').sort("a").collect()
> Exception raised:
>     Traceback (most recent call last):
>       File "/opt/miniconda3/envs/py10/lib/python3.10/doctest.py", line 1348, 
> in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, 
> in 
>         left._joinAsOf(
>       File "/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py", line 
> 1578, in _joinAsOf
>         jdf = self._jdf.joinAsOf(
>       File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
>  line 1309, in __call__
>         return_value = get_return_value(
>       File "/Users/xinrong.meng/spark/python/pyspark/sql/utils.py", line 178, 
> in deco
>         return f(*a, **kw)
>       File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py",
>  line 330, in get_return_value
>         raise Py4JError(
>     py4j.protocol.Py4JError: An error occurred while calling o283.joinAsOf. 
> Trace:
>     py4j.Py4JException: Method joinAsOf([class org.apache.spark.sql.Dataset, 
> class org.apache.spark.sql.Column, class org.apache.spark.sql.Column, null, 
> class java.lang.String, class org.apache.spark.sql.Column, class 
> java.lang.Boolean, class java.lang.String]) does not exist
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37251) Failed _joinAsOf doctest

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37251:
--
Parent: (was: SPARK-37246)
Issue Type: Bug  (was: Sub-task)

> Failed _joinAsOf doctest
> 
>
> Key: SPARK-37251
> URL: https://issues.apache.org/jira/browse/SPARK-37251
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> File "/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py", line 1523, 
> in pyspark.sql.dataframe.DataFrame._joinAsOf
> Failed example:
>     left._joinAsOf(
>         right, leftAsOfColumn="a", rightAsOfColumn="a", how="left", 
> tolerance=F.lit(1)
>     ).select(left.a, 'left_val', 'right_val').sort("a").collect()
> Exception raised:
>     Traceback (most recent call last):
>       File "/opt/miniconda3/envs/py10/lib/python3.10/doctest.py", line 1348, 
> in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, 
> in 
>         left._joinAsOf(
>       File "/Users/xinrong.meng/spark/python/pyspark/sql/dataframe.py", line 
> 1578, in _joinAsOf
>         jdf = self._jdf.joinAsOf(
>       File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
>  line 1309, in __call__
>         return_value = get_return_value(
>       File "/Users/xinrong.meng/spark/python/pyspark/sql/utils.py", line 178, 
> in deco
>         return f(*a, **kw)
>       File 
> "/Users/xinrong.meng/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py",
>  line 330, in get_return_value
>         raise Py4JError(
>     py4j.protocol.Py4JError: An error occurred while calling o283.joinAsOf. 
> Trace:
>     py4j.Py4JException: Method joinAsOf([class org.apache.spark.sql.Dataset, 
> class org.apache.spark.sql.Column, class org.apache.spark.sql.Column, null, 
> class java.lang.String, class org.apache.spark.sql.Column, class 
> java.lang.Boolean, class java.lang.String]) does not exist
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-37250) Failed test_capture_user_friendly_exception (pyspark.sql.tests.test_utils.UtilsTests)

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37250.
-

> Failed test_capture_user_friendly_exception 
> (pyspark.sql.tests.test_utils.UtilsTests)
> -
>
> Key: SPARK-37250
> URL: https://issues.apache.org/jira/browse/SPARK-37250
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/tests/test_utils.py", 
> line 34, in test_capture_user_friendly_exception
>     self.spark.sql("select `中文字段`")
> pyspark.sql.utils.AnalysisException: cannot resolve '`中文字段`' given input 
> columns: []; line 1 7;
> 'Project ['中文字段]
> +- OneRowRelation {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37257.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34533
[https://github.com/apache/spark/pull/34533]

> Update setup.py for Python 3.10
> ---
>
> Key: SPARK-37257
> URL: https://issues.apache.org/jira/browse/SPARK-37257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
> setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37257:
-

Assignee: Hyukjin Kwon

> Update setup.py for Python 3.10
> ---
>
> Key: SPARK-37257
> URL: https://issues.apache.org/jira/browse/SPARK-37257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
> setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441242#comment-17441242
 ] 

Apache Spark commented on SPARK-35011:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/34536

> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Sumeet
>Priority: Major
>  Labels: BlockManager, core
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37259) JDBC read is always going to wrap the query in a select statement

2021-11-09 Thread Kevin Appel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Appel updated SPARK-37259:

Description: 
The read jdbc is wrapping the query it sends to the database server inside a 
select statement and there is no way to override this currently.

Initially I ran into this issue when trying to run a CTE query against SQL 
server and it fails, the details of the failure is in these cases:

[https://github.com/microsoft/mssql-jdbc/issues/1340]

[https://github.com/microsoft/mssql-jdbc/issues/1657]

[https://github.com/microsoft/sql-spark-connector/issues/147]

https://issues.apache.org/jira/browse/SPARK-32825

https://issues.apache.org/jira/browse/SPARK-34928

I started to patch the code to get the query to run and ran into a few 
different items, if there is a way to add these features to allow this code 
path to run, this would be extremely helpful to running these type of edge case 
queries.  These are basic examples here the actual queries are much more 
complex and would require significant time to rewrite.

Inside JDBCOptions.scala the query is being set to either, using the dbtable 
this allows the query to be passed without modification

 
{code:java}
name.trim
or
s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
{code}
 

Inside JDBCRelation.scala this is going to try to get the schema for this 
query, and this ends up running dialect.getSchemaQuery which is doing:
{code:java}
s"SELECT * FROM $table WHERE 1=0"{code}
Overriding the dialect here and initially just passing back the $table gets 
passed here and to the next issue which is in the compute function in 
JDBCRDD.scala

 
{code:java}
val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} 
$myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause"
 
{code}
 

For these two queries, about a CTE query and using temp tables, finding out the 
schema is difficult without actually running the query and for the temp table 
if you run it in the schema check that will have the table now exist and fail 
when it runs the actual query.

 

The way I patched these is by doing these two items:

JDBCRDD.scala (compute)

 
{code:java}
    val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", 
"false").toBoolean
    val sqlText = if (runQueryAsIs) {
      s"${options.tableOrQuery}"
    } else {
      s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause"
    }

{code}
JDBCRelation.scala (getSchema)
{code:java}
val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", 
"false").toBoolean
    if (useCustomSchema) {
      val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", 
"").toString
      val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema)
      logInfo(s"Going to return the new $newSchema because useCustomSchema is 
$useCustomSchema and passed in $myCustomSchema")
      newSchema
    } else {
      val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
      jdbcOptions.customSchema match {
      case Some(customSchema) => JdbcUtils.getCustomSchema(
        tableSchema, customSchema, resolver)
      case None => tableSchema
      }
    }{code}
 

This is allowing the query to run as is, by using the dbtable option and then 
provide a custom schema that will bypass the dialect schema check

 

Test queries

 
{code:java}
query1 = """ 
SELECT 1 as DummyCOL
"""
query2 = """ 
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)
SELECT *
FROM DummyCTE
"""
query3 = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@VERSION as version) data
)
(SELECT *
FROM
#Temp1a)
"""
{code}
 

Test schema

 
{code:java}
schema1 = """
DummyXCOL INT
"""
schema2 = """
DummyXCOL STRING
"""
{code}
 

Test code

 
{code:java}
jdbcDFWorking = (
    spark.read.format("jdbc")
    .option("url", f"jdbc:sqlserver://{server}:{port};databaseName={database};")
    .option("user", user)
    .option("password", password)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", queryx)
    .option("customSchema", schemax)
    .option("useCustomSchema", "true")
    .option("runQueryAsIs", "true")
    .load()
)
 
{code}
 

Currently we ran into this on these two special SQL server queries however we 
aren't sure if there is other DB's we are using that we haven't hit this type 
of issue yet, without going through this I didn't realize the query is always 
wrapped in the SELECT no matter what you do.

This is on the Spark 3.1.2 and using the PySpark with the Python 3.7.11

Thank you for your consideration and assistance to a way to fix this

Kevin

 

 

 

  was:
The read jdbc is wrapping the query it sends to the database server inside a 
select statement and there is no way to override this currently.

Initially I ran into this issue when trying to run a CTE query against SQL 
server and it fails, the details of the failure is in these cases:


[jira] [Commented] (SPARK-35557) Adapt uses of JDK 17 Internal APIs

2021-11-09 Thread Olivier Peyrusse (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441227#comment-17441227
 ] 

Olivier Peyrusse commented on SPARK-35557:
--

Hello, everyone,

I see a similar issue when creating a basic local session 
{{SparkSession.builder().appName("abc").config("spark.master", 
"local").getOrCreate()}}, throwing because
bq. class org.apache.spark.storage.StorageUtils$ cannot access class 
sun.nio.ch.DirectBuffer
Do you want me to open a specific ticket for this?

Link to the particular code: 
https://github.com/apache/spark/blob/8ae88d01b46d581367d0047b50fcfb65078ab972/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala#L206-L223
>From what I see, in jdk17, we could directly call {{Unsafe#invokeCleaner}} as 
>it accepts a ByteBuffer and internally casts it to a DirectBuffer. And this is 
>the case since we call it with a {{MappedByteBuffer}} [2]. 
But it would require more tricks to make it work for jdk8 (certainly something 
like loading the class from its name, using reflection to access the cleaner, 
etc)

[2] 
https://github.com/apache/spark/blob/8ae88d01b46d581367d0047b50fcfb65078ab972/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala#L234-L237

> Adapt uses of JDK 17 Internal APIs
> --
>
> Key: SPARK-35557
> URL: https://issues.apache.org/jira/browse/SPARK-35557
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with 
> Spark 2.12.4 on Java 17 and I found this exception:
> {code:java}
> java.lang.ExceptionInInitializerError
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
> ...
> Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
> private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
> not "opens java.nio" to unnamed module @110df513
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:357)
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:297)
>  at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
>  at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
>  at org.apache.spark.unsafe.Platform. (Platform.java:56)
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
> {code}
> It seems that Java 17 will be more strict about uses of JDK Internals 
> [https://openjdk.java.net/jeps/403]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37259) JDBC read is always going to wrap the query in a select statement

2021-11-09 Thread Kevin Appel (Jira)
Kevin Appel created SPARK-37259:
---

 Summary: JDBC read is always going to wrap the query in a select 
statement
 Key: SPARK-37259
 URL: https://issues.apache.org/jira/browse/SPARK-37259
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Kevin Appel


The read jdbc is wrapping the query it sends to the database server inside a 
select statement and there is no way to override this currently.

Initially I ran into this issue when trying to run a CTE query against SQL 
server and it fails, the details of the failure is in these cases:

[https://github.com/microsoft/mssql-jdbc/issues/1340]

[https://github.com/microsoft/mssql-jdbc/issues/1657]

[https://github.com/microsoft/sql-spark-connector/issues/147]

https://issues.apache.org/jira/browse/SPARK-32825

https://issues.apache.org/jira/browse/SPARK-34928

I started to patch the code to get the query to run and ran into a few 
different items, if there is a way to add these features to allow this code 
path to run, this would be extremely helpful to running these type of edge case 
queries.  These are basic examples here the actual queries are much more 
complex and would require significant time to rewrite.

Inside JDBCOptions.scala the query is being set to either, using the dbtable 
this allows the query to be passed without modification

 
{code:java}
name.trim
or
s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
{code}
 

Inside JDBCRelation.scala this is going to try to get the schema for this 
query, and this ends up running dialect.getSchemaQuery which is doing:
{code:java}
s"SELECT * FROM $table WHERE 1=0"{code}
Overriding the dialect here and initially just passing back the $table gets 
passed here and to the next issue which is in the compute function in 
JDBCRDD.scala

 
{code:java}
val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} 
$myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause"
 
{code}
 

For these two queries, about a CTE query and using temp tables, finding out the 
schema is difficult without actually running the query and for the temp table 
if you run it in the schema check that will have the table now exist and fail 
when it runs the actual query.

 

The way I patched these is by doing these two items:

JDBCRDD.scala (compute)

 
{code:java}
    val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", 
"false").toBoolean
    val sqlText = if (runQueryAsIs) {
      s"${options.tableOrQuery}"
    } else {
      s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause"
    }
JDBC
{code}
Relation.scala (getSchema)
{code:java}
val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", 
"false").toBoolean
    if (useCustomSchema) {
      val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", 
"").toString
      val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema)
      logInfo(s"Going to return the new $newSchema because useCustomSchema is 
$useCustomSchema and passed in $myCustomSchema")
      newSchema
    } else {
      val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
      jdbcOptions.customSchema match {
      case Some(customSchema) => JdbcUtils.getCustomSchema(
        tableSchema, customSchema, resolver)
      case None => tableSchema
      }
    }{code}
 

This is allowing the query to run as is, by using the dbtable option and then 
provide a custom schema that will bypass the dialect schema check

 

Test queries

 
{code:java}
query1 = """ 
SELECT 1 as DummyCOL
"""
query2 = """ 
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)
SELECT *
FROM DummyCTE
"""
query3 = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@VERSION as version) data
)
(SELECT *
FROM
#Temp1a)
"""
{code}
 

Test schema

 
{code:java}
schema1 = """
DummyXCOL INT
"""
schema2 = """
DummyXCOL STRING
"""
{code}
 

Test code

 
{code:java}
jdbcDFWorking = (
    spark.read.format("jdbc")
    .option("url", f"jdbc:sqlserver://{server}:{port};databaseName={database};")
    .option("user", user)
    .option("password", password)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("dbtable", queryx)
    .option("customSchema", schemax)
    .option("useCustomSchema", "true")
    .option("runQueryAsIs", "true")
    .load()
)
 
{code}
 

Currently we ran into this on these two special SQL server queries however we 
aren't sure if there is other DB's we are using that we haven't hit this type 
of issue yet, without going through this I didn't realize the query is always 
wrapped in the SELECT no matter what you do.

This is on the Spark 3.1.2 and using the PySpark with the Python 3.7.11

Thank you for your consideration and assistance to a way to fix this

Kevin

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441205#comment-17441205
 ] 

Apache Spark commented on SPARK-37201:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34535

> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}
>  
> *Yet another example: left_anti join after double select:*
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, field1: String, field2: String)
> Seq(
>   Event(Struct("v1","v2","v3"), "fld1", "fld2")
> ).toDF().write.mode("overwrite").saveAsTable("table")
> val joinDf = Seq("id1").toDF("id")
> spark.table("table")
>   .select("struct", "field1")
>   .select($"struct.v1", $"field1")
>   .join(joinDf, $"field1" === joinDf("id"), "left_anti")
>   .explain(true)
> // ===> ReadSchema: 
> struct,field1:string>
> {code}
> Instead of the first select, it can be other types of manipulations with the 
> original df, for example {color:#00875a}.withColumn("field3", 
> lit("f3")){color} or {color:#00875a}.drop("field2"){color}, which will also 
> lead to reading unnecessary nested fields from _struct_.
> But if you just remove the first select or change type of join, reading 
> nested fields will be correct:
> {code:java}
> // .select("struct", "field1")
> ===> ReadSchema: struct,field1:string>
> .join(joinDf, $"field1" === joinDf("id"), "left")
> ===> ReadSchema: struct,field1:string>
> {code}
> PS: The first select might look strange in the context of this example, but 
> in a real system, it might be part of a common api, that other parts of the 
> system use with their own expressions on top of this api.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37201:


Assignee: (was: Apache Spark)

> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}
>  
> *Yet another example: left_anti join after double select:*
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, field1: String, field2: String)
> Seq(
>   Event(Struct("v1","v2","v3"), "fld1", "fld2")
> ).toDF().write.mode("overwrite").saveAsTable("table")
> val joinDf = Seq("id1").toDF("id")
> spark.table("table")
>   .select("struct", "field1")
>   .select($"struct.v1", $"field1")
>   .join(joinDf, $"field1" === joinDf("id"), "left_anti")
>   .explain(true)
> // ===> ReadSchema: 
> struct,field1:string>
> {code}
> Instead of the first select, it can be other types of manipulations with the 
> original df, for example {color:#00875a}.withColumn("field3", 
> lit("f3")){color} or {color:#00875a}.drop("field2"){color}, which will also 
> lead to reading unnecessary nested fields from _struct_.
> But if you just remove the first select or change type of join, reading 
> nested fields will be correct:
> {code:java}
> // .select("struct", "field1")
> ===> ReadSchema: struct,field1:string>
> .join(joinDf, $"field1" === joinDf("id"), "left")
> ===> ReadSchema: struct,field1:string>
> {code}
> PS: The first select might look strange in the context of this example, but 
> in a real system, it might be part of a common api, that other parts of the 
> system use with their own expressions on top of this api.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37201:


Assignee: Apache Spark

> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Assignee: Apache Spark
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}
>  
> *Yet another example: left_anti join after double select:*
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, field1: String, field2: String)
> Seq(
>   Event(Struct("v1","v2","v3"), "fld1", "fld2")
> ).toDF().write.mode("overwrite").saveAsTable("table")
> val joinDf = Seq("id1").toDF("id")
> spark.table("table")
>   .select("struct", "field1")
>   .select($"struct.v1", $"field1")
>   .join(joinDf, $"field1" === joinDf("id"), "left_anti")
>   .explain(true)
> // ===> ReadSchema: 
> struct,field1:string>
> {code}
> Instead of the first select, it can be other types of manipulations with the 
> original df, for example {color:#00875a}.withColumn("field3", 
> lit("f3")){color} or {color:#00875a}.drop("field2"){color}, which will also 
> lead to reading unnecessary nested fields from _struct_.
> But if you just remove the first select or change type of join, reading 
> nested fields will be correct:
> {code:java}
> // .select("struct", "field1")
> ===> ReadSchema: struct,field1:string>
> .join(joinDf, $"field1" === joinDf("id"), "left")
> ===> ReadSchema: struct,field1:string>
> {code}
> PS: The first select might look strange in the context of this example, but 
> in a real system, it might be part of a common api, that other parts of the 
> system use with their own expressions on top of this api.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37258) Add Volcano support in kubernetes-client

2021-11-09 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37258:

Description: 
We need to add a volcano in [1].

There were some comments on k8s-client repo to tell us how to add support of 
extension [2]. 

 

And then bump the k8s-client to latest version.

[1] [https://github.com/fabric8io/kubernetes-client/tree/master/extensions]

[2] 
[https://github.com/fabric8io/kubernetes-client/issues/2565#issuecomment-718150266]

 

  was:
We need to add a volcano in [1].

There were some comments on k8s-client repo to tell us how to add support of 
extension [2]. 

 

[1] [https://github.com/fabric8io/kubernetes-client/tree/master/extensions]

[2] 
[https://github.com/fabric8io/kubernetes-client/issues/2565#issuecomment-718150266]

 


> Add Volcano support in kubernetes-client
> 
>
> Key: SPARK-37258
> URL: https://issues.apache.org/jira/browse/SPARK-37258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> We need to add a volcano in [1].
> There were some comments on k8s-client repo to tell us how to add support of 
> extension [2]. 
>  
> And then bump the k8s-client to latest version.
> [1] [https://github.com/fabric8io/kubernetes-client/tree/master/extensions]
> [2] 
> [https://github.com/fabric8io/kubernetes-client/issues/2565#issuecomment-718150266]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37258) Add Volcano support in kubernetes-client

2021-11-09 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-37258:
---

 Summary: Add Volcano support in kubernetes-client
 Key: SPARK-37258
 URL: https://issues.apache.org/jira/browse/SPARK-37258
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yikun Jiang


We need to add a volcano in [1].

There were some comments on k8s-client repo to tell us how to add support of 
extension [2]. 

 

[1] [https://github.com/fabric8io/kubernetes-client/tree/master/extensions]

[2] 
[https://github.com/fabric8io/kubernetes-client/issues/2565#issuecomment-718150266]

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-09 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441189#comment-17441189
 ] 

angerszhu commented on SPARK-37201:
---

Working on this

> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}
>  
> *Yet another example: left_anti join after double select:*
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, field1: String, field2: String)
> Seq(
>   Event(Struct("v1","v2","v3"), "fld1", "fld2")
> ).toDF().write.mode("overwrite").saveAsTable("table")
> val joinDf = Seq("id1").toDF("id")
> spark.table("table")
>   .select("struct", "field1")
>   .select($"struct.v1", $"field1")
>   .join(joinDf, $"field1" === joinDf("id"), "left_anti")
>   .explain(true)
> // ===> ReadSchema: 
> struct,field1:string>
> {code}
> Instead of the first select, it can be other types of manipulations with the 
> original df, for example {color:#00875a}.withColumn("field3", 
> lit("f3")){color} or {color:#00875a}.drop("field2"){color}, which will also 
> lead to reading unnecessary nested fields from _struct_.
> But if you just remove the first select or change type of join, reading 
> nested fields will be correct:
> {code:java}
> // .select("struct", "field1")
> ===> ReadSchema: struct,field1:string>
> .join(joinDf, $"field1" === joinDf("id"), "left")
> ===> ReadSchema: struct,field1:string>
> {code}
> PS: The first select might look strange in the context of this example, but 
> in a real system, it might be part of a common api, that other parts of the 
> system use with their own expressions on top of this api.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441180#comment-17441180
 ] 

Apache Spark commented on SPARK-37257:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34533

> Update setup.py for Python 3.10
> ---
>
> Key: SPARK-37257
> URL: https://issues.apache.org/jira/browse/SPARK-37257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
> setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37257:


Assignee: Apache Spark

> Update setup.py for Python 3.10
> ---
>
> Key: SPARK-37257
> URL: https://issues.apache.org/jira/browse/SPARK-37257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
> setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441178#comment-17441178
 ] 

Apache Spark commented on SPARK-37257:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34533

> Update setup.py for Python 3.10
> ---
>
> Key: SPARK-37257
> URL: https://issues.apache.org/jira/browse/SPARK-37257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
> setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37257:


Assignee: (was: Apache Spark)

> Update setup.py for Python 3.10
> ---
>
> Key: SPARK-37257
> URL: https://issues.apache.org/jira/browse/SPARK-37257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
> setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37257) Update setup.py for Python 3.10

2021-11-09 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37257:


 Summary: Update setup.py for Python 3.10
 Key: SPARK-37257
 URL: https://issues.apache.org/jira/browse/SPARK-37257
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


SPARK-37244 makes sure that PySpark works with Python 3.10. We should update 
setup.py  to officialy support Python 3.10.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37256) Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation warnings

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37256:


Assignee: Apache Spark

> Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation 
> warnings
> -
>
> Key: SPARK-37256
> URL: https://issues.apache.org/jira/browse/SPARK-37256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> There are some compilation warning log as follows:
> {code:java}
> [WARNING] [Warn] 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala:268:
>  [deprecation @ 
> org.apache.spark.sql.catalyst.util.RebaseDateTime.loadRebaseRecords.mapper.$anon
>  | origin=com.fasterxml.jackson.module.scala.ScalaObjectMapper | 
> version=2.12.1] trait ScalaObjectMapper in package scala is deprecated (since 
> 2.12.1): ScalaObjectMapper is deprecated because Manifests are not supported 
> in Scala3 {code}
> We can refer to the recommendations of `jackson-module-scala` for fix
> {code:java}
> ScalaObjectMapper is deprecated because Manifests are not supported in 
> Scala3, you might want to use ClassTagExtensions as a replacement {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37256) Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation warnings

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37256:


Assignee: (was: Apache Spark)

> Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation 
> warnings
> -
>
> Key: SPARK-37256
> URL: https://issues.apache.org/jira/browse/SPARK-37256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some compilation warning log as follows:
> {code:java}
> [WARNING] [Warn] 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala:268:
>  [deprecation @ 
> org.apache.spark.sql.catalyst.util.RebaseDateTime.loadRebaseRecords.mapper.$anon
>  | origin=com.fasterxml.jackson.module.scala.ScalaObjectMapper | 
> version=2.12.1] trait ScalaObjectMapper in package scala is deprecated (since 
> 2.12.1): ScalaObjectMapper is deprecated because Manifests are not supported 
> in Scala3 {code}
> We can refer to the recommendations of `jackson-module-scala` for fix
> {code:java}
> ScalaObjectMapper is deprecated because Manifests are not supported in 
> Scala3, you might want to use ClassTagExtensions as a replacement {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37256) Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation warnings

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441142#comment-17441142
 ] 

Apache Spark commented on SPARK-37256:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34532

> Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation 
> warnings
> -
>
> Key: SPARK-37256
> URL: https://issues.apache.org/jira/browse/SPARK-37256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some compilation warning log as follows:
> {code:java}
> [WARNING] [Warn] 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala:268:
>  [deprecation @ 
> org.apache.spark.sql.catalyst.util.RebaseDateTime.loadRebaseRecords.mapper.$anon
>  | origin=com.fasterxml.jackson.module.scala.ScalaObjectMapper | 
> version=2.12.1] trait ScalaObjectMapper in package scala is deprecated (since 
> 2.12.1): ScalaObjectMapper is deprecated because Manifests are not supported 
> in Scala3 {code}
> We can refer to the recommendations of `jackson-module-scala` for fix
> {code:java}
> ScalaObjectMapper is deprecated because Manifests are not supported in 
> Scala3, you might want to use ClassTagExtensions as a replacement {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37256) Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation warnings

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441141#comment-17441141
 ] 

Apache Spark commented on SPARK-37256:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34532

> Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation 
> warnings
> -
>
> Key: SPARK-37256
> URL: https://issues.apache.org/jira/browse/SPARK-37256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some compilation warning log as follows:
> {code:java}
> [WARNING] [Warn] 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala:268:
>  [deprecation @ 
> org.apache.spark.sql.catalyst.util.RebaseDateTime.loadRebaseRecords.mapper.$anon
>  | origin=com.fasterxml.jackson.module.scala.ScalaObjectMapper | 
> version=2.12.1] trait ScalaObjectMapper in package scala is deprecated (since 
> 2.12.1): ScalaObjectMapper is deprecated because Manifests are not supported 
> in Scala3 {code}
> We can refer to the recommendations of `jackson-module-scala` for fix
> {code:java}
> ScalaObjectMapper is deprecated because Manifests are not supported in 
> Scala3, you might want to use ClassTagExtensions as a replacement {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37256) Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation warnings

2021-11-09 Thread Yang Jie (Jira)
Yang Jie created SPARK-37256:


 Summary: Replace `ScalaObjectMapper` with `ClassTagExtensions` to 
fix compilation warnings
 Key: SPARK-37256
 URL: https://issues.apache.org/jira/browse/SPARK-37256
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


There are some compilation warning log as follows:
{code:java}
[WARNING] [Warn] 
/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala:268:
 [deprecation @ 
org.apache.spark.sql.catalyst.util.RebaseDateTime.loadRebaseRecords.mapper.$anon
 | origin=com.fasterxml.jackson.module.scala.ScalaObjectMapper | 
version=2.12.1] trait ScalaObjectMapper in package scala is deprecated (since 
2.12.1): ScalaObjectMapper is deprecated because Manifests are not supported in 
Scala3 {code}
We can refer to the recommendations of `jackson-module-scala` for fix
{code:java}
ScalaObjectMapper is deprecated because Manifests are not supported in Scala3, 
you might want to use ClassTagExtensions as a replacement {code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

2021-11-09 Thread ramakrishna chilaka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441131#comment-17441131
 ] 

ramakrishna chilaka commented on SPARK-25224:
-

can anyone please confirm, if there are any plans to revive this ? Thanks.

> Improvement of Spark SQL ThriftServer memory management
> ---
>
> Key: SPARK-25224
> URL: https://issues.apache.org/jira/browse/SPARK-25224
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>  Labels: bulk-closed
>
> Spark SQL just have two options for managing thriftserver memory - enable 
> spark.sql.thriftServer.incrementalCollect or not
> *1. The case of enabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* thriftserver can handle large output without OOM.
> *2) Cons*
>  * Performance degradation because of executing task partition by partition.
>  * Handle queries with count-limit inefficiently because of executing all 
> partitions. (executeTake stop scanning after collecting count-limit.)
>  * Cannot cache result for FETCH_FIRST
> *2. The case of disabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* Good performance for small output
> *2) Cons*
>  * Memory peak usage is too large because allocating decompressed & 
> deserialized rows in "batch" manner, and OOM could occur for large output.
>  * It is difficult to measure memory peak usage of Query, so configuring 
> spark.driver.maxResultSize is very difficult.
>  * If decompressed & deserialized rows fills up eden area of JVM Heap, they 
> moves to old Gen and could increase possibility of "Full GC" that stops the 
> world.
>  
> The improvement idea is below:
>  # *DataSet does not decompress & deserialize result, and just return total 
> row count & iterator to SQL-Executor.* By doing that, only uncompressed data 
> reside in memory, so that the memory usage is not only much lower than before 
> but is configurable with using spark.driver.maxResultSize.
>  # *After SQL-Executor get total row count & iterator from DataSet, it could 
> decide whether collecting them as batch manner(appropriate for small row 
> count) or deserializing and sending them iteratively (appropriate for large 
> row count) with considering returned row count.*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37253.
--
Fix Version/s: 3.3.0
   3.2.1
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 34530
[https://github.com/apache/spark/pull/34530]

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py",
>  line 630, in main
> tb = try_simplify_traceback(sys.exc_info()[-1])
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
> line 217, in try_simplify_traceback
> new_tb = types.TracebackType(
> TypeError: 'NoneType' object cannot be interpreted as an integer
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37253:


Assignee: Apache Spark

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py",
>  line 630, in main
> tb = try_simplify_traceback(sys.exc_info()[-1])
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
> line 217, in try_simplify_traceback
> new_tb = types.TracebackType(
> TypeError: 'NoneType' object cannot be interpreted as an integer
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI

2021-11-09 Thread ramakrishna chilaka (Jira)
ramakrishna chilaka created SPARK-37255:
---

 Summary: When Used with PyHive (by dropbox) query timeout doesn't 
result in propagation to the UI
 Key: SPARK-37255
 URL: https://issues.apache.org/jira/browse/SPARK-37255
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: ramakrishna chilaka


When we run a large query and it is timed out by spark thrift server and when 
it is cancelled, PyHive doesn't show that query is cancelled. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37254) 100% CPU usage on Spark Thrift Server.

2021-11-09 Thread ramakrishna chilaka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramakrishna chilaka updated SPARK-37254:

Description: 
We are trying to use Spark thrift server as a distributed sql query engine, the 
queries work when the resident memory occupied by Spark thrift server 
identified through HTOP is comparatively less than the driver memory. The same 
queries result in 100% cpu usage when the resident memory occupied by spark 
thrift server is greater than the configured driver memory and keeps running at 
100% cpu usage. I am using incremental collect as false, as i need faster 
responses for exploratory queries. I am trying to understand the following 
points
 * Why isn't spark thrift server releasing back the memory, when there are no 
queries. 
 * What is causing spark thrift server to go into 100% cpu usage on all the 
cores, when spark thrift server's memory is greater than the driver memory (by 
10% usually) and why are queries just stuck.

  was:
We are trying to use Spark thrift server as a distributed sql query engine, the 
queries work when the resident memory occupied by Spark thrift server 
identified through HTOP is comparatively less than the driver memory. The same 
queries result in 100% cpu usage when the resident memory occupied by spark 
thrift server is greater than the configured driver memory. I am using 
incremental collect as false, as i need faster responses for exploratory 
queries. I am trying to understand the following points
 * Why isn't spark thrift server releasing back the memory, when there are no 
queries. 
 * What is causing spark thrift server to go into 100% cpu usage on all the 
cores and why are queries just stuck.


> 100% CPU usage on Spark Thrift Server.
> --
>
> Key: SPARK-37254
> URL: https://issues.apache.org/jira/browse/SPARK-37254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: ramakrishna chilaka
>Priority: Major
>
> We are trying to use Spark thrift server as a distributed sql query engine, 
> the queries work when the resident memory occupied by Spark thrift server 
> identified through HTOP is comparatively less than the driver memory. The 
> same queries result in 100% cpu usage when the resident memory occupied by 
> spark thrift server is greater than the configured driver memory and keeps 
> running at 100% cpu usage. I am using incremental collect as false, as i need 
> faster responses for exploratory queries. I am trying to understand the 
> following points
>  * Why isn't spark thrift server releasing back the memory, when there are no 
> queries. 
>  * What is causing spark thrift server to go into 100% cpu usage on all the 
> cores, when spark thrift server's memory is greater than the driver memory 
> (by 10% usually) and why are queries just stuck.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37254) 100% CPU usage on Spark Thrift Server.

2021-11-09 Thread ramakrishna chilaka (Jira)
ramakrishna chilaka created SPARK-37254:
---

 Summary: 100% CPU usage on Spark Thrift Server.
 Key: SPARK-37254
 URL: https://issues.apache.org/jira/browse/SPARK-37254
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: ramakrishna chilaka


We are trying to use Spark thrift server as a distributed sql query engine, the 
queries work when the resident memory occupied by Spark thrift server 
identified through HTOP is comparatively less than the driver memory. The same 
queries result in 100% cpu usage when the resident memory occupied by spark 
thrift server is greater than the configured driver memory. I am using 
incremental collect as false, as i need faster responses for exploratory 
queries. I am trying to understand the following points
 * Why isn't spark thrift server releasing back the memory, when there are no 
queries. 
 * What is causing spark thrift server to go into 100% cpu usage on all the 
cores and why are queries just stuck.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441038#comment-17441038
 ] 

Apache Spark commented on SPARK-37239:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34531

> Avoid unnecessary `setReplication` in Yarn mode
> ---
>
> Key: SPARK-37239
> URL: https://issues.apache.org/jira/browse/SPARK-37239
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: wang-zhun
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> We found a large number of replication logs in hdfs server   
> ```
> 2021-11-04,17:22:13,065 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip
> 2021-11-04,17:22:13,069 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip
> 2021-11-04,17:22:13,070 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip
> ```
> https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439
>   
> `setReplication` needs to acquire write lock, we should reduce this 
> unnecessary operation



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37253:
--
Description: 
{code}
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py", 
line 630, in main
tb = try_simplify_traceback(sys.exc_info()[-1])
  File 
"/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
line 217, in try_simplify_traceback
new_tb = types.TracebackType(
TypeError: 'NoneType' object cannot be interpreted as an integer
{code}

  was:
{code}
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py", 
line 630, in main
tb = try_simplify_traceback(sys.exc_info()[-1])
  File 
"/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
line 217, in try_simplify_traceback
new_tb = types.TracebackType(
TypeError: 'NoneType' object cannot be interpreted as an integer
{code}


> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py",
>  line 630, in main
> tb = try_simplify_traceback(sys.exc_info()[-1])
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
> line 217, in try_simplify_traceback
> new_tb = types.TracebackType(
> TypeError: 'NoneType' object cannot be interpreted as an integer
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37253:
--
Description: 
{code}
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py", 
line 630, in main
tb = try_simplify_traceback(sys.exc_info()[-1])
  File 
"/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
line 217, in try_simplify_traceback
new_tb = types.TracebackType(
TypeError: 'NoneType' object cannot be interpreted as an integer
{code}

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py",
>  line 630, in main
> tb = try_simplify_traceback(sys.exc_info()[-1])
>   File 
> "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", 
> line 217, in try_simplify_traceback
> new_tb = types.TracebackType(
> TypeError: 'NoneType' object cannot be interpreted as an integer
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37253:
--
Affects Version/s: 3.2.0
   3.1.2

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440998#comment-17440998
 ] 

Apache Spark commented on SPARK-37253:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34530

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37253:


Assignee: (was: Apache Spark)

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37253:


Assignee: Apache Spark

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440996#comment-17440996
 ] 

Apache Spark commented on SPARK-37253:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34530

> try_simplify_traceback should not fail when tb_frame.f_lineno is None
> -
>
> Key: SPARK-37253
> URL: https://issues.apache.org/jira/browse/SPARK-37253
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37253) try_simplify_traceback should not fail when tb_frame.f_lineno is None

2021-11-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37253:
-

 Summary: try_simplify_traceback should not fail when 
tb_frame.f_lineno is None
 Key: SPARK-37253
 URL: https://issues.apache.org/jira/browse/SPARK-37253
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org