kuwii commented on PR #39190:
URL: https://github.com/apache/spark/pull/39190#issuecomment-1398579998

   I'm not familiar with how Spark creates and runs jobs and stages for a 
query, but I think it may be related to this case. I can reproduce this locally 
using Spark on Yarn mode with this code:
   
   ```python
   from pyspark import SparkConf, SparkContext
   from pyspark.sql import SQLContext
   from pyspark.sql.functions import countDistinct, col, count, when
   import time
   
   conf = SparkConf().setAppName('test')
   sc = SparkContext(conf = conf)
   spark = SQLContext(sc).sparkSession
   
   spark.range(1, 100).count()
   ```
   
   The execution for `count` creates 2 jobs: job 0 with stage 0 and job 1 with 
stage 1, 2.
   
   
![image](https://user-images.githubusercontent.com/10705175/213734447-2b1748e2-f073-4d68-b2b0-7793fbd80ca0.png)
   
   Because of some logic, stage 1 will always be skipped, not even submitted.
   
   
![image](https://user-images.githubusercontent.com/10705175/213736105-c5d0eedc-ed0a-4f23-933b-eebe34244db5.png)
   
   This is the case that is mentioned in the PR's description. And because the 
incorrect logic of updating `numActiveStages`, it will be `-1` in jobs API. 
This PR can fix it.
   
   
![image](https://user-images.githubusercontent.com/10705175/213740564-47b6e6eb-8d09-4eca-a340-3a98c912c69a.png)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to