[jira] [Created] (SPARK-42834) Divided by zero occurs in PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse

2023-03-17 Thread Li Ying (Jira)
Li Ying created SPARK-42834:
---

 Summary: Divided by zero occurs in 
PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse
 Key: SPARK-42834
 URL: https://issues.apache.org/jira/browse/SPARK-42834
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: Li Ying


{color:#22}Sometimes when run a SQL job with push based shuffle, exception 
occurs as below.  It seems that there’s no element in the bitmaps which stores 
merge chunk meta. See 
org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse.{color}
{color:#22} {color}
{color:#22}Is it a bug that we should not createChunkBlockInfos when 
bitmaps is empty or the bitmaps should never be empty here ?{color}
 
{code:java}
Caused by: java.lang.ArithmeticException: / by zero
at 
org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
 {code}
related code:
{code:java}
def createChunkBlockInfosFromMetaResponse(
shuffleId: Int,
shuffleMergeId: Int,
reduceId: Int,
blockSize: Long,
bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
  val approxChunkSize = blockSize / bitmaps.length
  val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
  for (i <- bitmaps.indices) {
val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, reduceId, 
i)
chunksMetaMap.put(blockChunkId, bitmaps(i))
logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
  }
  blocksToFetch
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42834) Divided by zero occurs in PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse

2023-03-17 Thread Li Ying (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Ying updated SPARK-42834:

Description: 
{color:#22}Sometimes when run a SQL job with push based shuffle, exception 
occurs as below.  It seems that there’s no element in the bitmaps which stores 
merge chunk meta. {color}


{color:#22}Is it a bug that we should not createChunkBlockInfos when 
bitmaps is empty or the bitmaps should never be empty here ?{color}
 
{code:java}
Caused by: java.lang.ArithmeticException: / by zero
at 
org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
 {code}
related code:
{code:java}
def createChunkBlockInfosFromMetaResponse(
shuffleId: Int,
shuffleMergeId: Int,
reduceId: Int,
blockSize: Long,
bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
  val approxChunkSize = blockSize / bitmaps.length
  val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
  for (i <- bitmaps.indices) {
val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, reduceId, 
i)
chunksMetaMap.put(blockChunkId, bitmaps(i))
logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
  }
  blocksToFetch
} {code}

  was:
{color:#22}Sometimes when run a SQL job with push based shuffle, exception 
occurs as below.  It seems that there’s no element in the bitmaps which stores 
merge chunk meta. See 
org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse.{color}
{color:#22} {color}
{color:#22}Is it a bug that we should not createChunkBlockInfos when 
bitmaps is empty or the bitmaps should never be empty here ?{color}
 
{code:java}
Caused by: java.lang.ArithmeticException: / by zero
at 
org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
 {code}
related code:
{code:java}
def createChunkBlockInfosFromMetaResponse(
shuffleId: Int,
shuffleMergeId: Int,
reduceId: Int,
blockSize: Long,
bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
  val approxChunkSize = blockSize / bitmaps.length
  val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
  for (i <- bitmaps.indices) {
val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, reduceId, 
i)
chunksMetaMap.put(blockChunkId, bitmaps(i))
logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
  }
  blocksToFetch
} {code}


> Divided by zero occurs in 
> PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse
> 
>
> Key: SPARK-42834
> URL: https://issues.apache.org/jira/browse/SPARK-42834
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Li Ying
>Priority: Major
>
> {color:#22}Sometimes when run a SQL job with push based shuffle, 
> exception occurs as below.  It seems that there’s no element in the bitmaps 
> which stores merge chunk meta. {color}
> {color:#22}Is it a bug that we should not createChunkBlockInfos when 
> bitmaps is empty or the bitmaps should never be empty here ?{color}
>  
> {code:java}
> Caused by: java.lang.ArithmeticException: / by zero
> at 
> org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
>  {code}
> related code:
> {code:java}
> def createChunkBlockInfosFromMetaResponse(
> shuffleId: Int,
> shuffleMergeId: Int,
> reduceId: Int,
> blockSize: Long,
> bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
>   val approxChunkSize = blockSize / bitmaps.length
>   val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
>   for (i <- bitmaps.indices) {
> val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, 
> reduceId, i)
> chunksMetaMap.put(blockChunkId, bitmaps(i))
> logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
> blocksToFetch += ((bl

[jira] [Commented] (SPARK-42834) Divided by zero occurs in PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse

2023-03-17 Thread Li Ying (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701610#comment-17701610
 ] 

Li Ying commented on SPARK-42834:
-

[~csingh] Could you please help confirm this?

> Divided by zero occurs in 
> PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse
> 
>
> Key: SPARK-42834
> URL: https://issues.apache.org/jira/browse/SPARK-42834
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Li Ying
>Priority: Major
>
> {color:#22}Sometimes when run a SQL job with push based shuffle, 
> exception occurs as below.  It seems that there’s no element in the bitmaps 
> which stores merge chunk meta. {color}
> {color:#22}Is it a bug that we should not createChunkBlockInfos when 
> bitmaps is empty or the bitmaps should never be empty here ?{color}
>  
> {code:java}
> Caused by: java.lang.ArithmeticException: / by zero
> at 
> org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
>  {code}
> related code:
> {code:java}
> def createChunkBlockInfosFromMetaResponse(
> shuffleId: Int,
> shuffleMergeId: Int,
> reduceId: Int,
> blockSize: Long,
> bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
>   val approxChunkSize = blockSize / bitmaps.length
>   val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
>   for (i <- bitmaps.indices) {
> val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, 
> reduceId, i)
> chunksMetaMap.put(blockChunkId, bitmaps(i))
> logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
> blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
>   }
>   blocksToFetch
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42834) Divided by zero occurs in PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse

2023-03-17 Thread Li Ying (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702050#comment-17702050
 ] 

Li Ying commented on SPARK-42834:
-

[~csingh] Thanks for help. I would take this fix :)

> Divided by zero occurs in 
> PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse
> 
>
> Key: SPARK-42834
> URL: https://issues.apache.org/jira/browse/SPARK-42834
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Li Ying
>Priority: Major
>
> {color:#22}Sometimes when run a SQL job with push based shuffle, 
> exception occurs as below.  It seems that there’s no element in the bitmaps 
> which stores merge chunk meta. {color}
> {color:#22}Is it a bug that we should not createChunkBlockInfos when 
> bitmaps is empty or the bitmaps should never be empty here ?{color}
>  
> {code:java}
> Caused by: java.lang.ArithmeticException: / by zero
> at 
> org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
>  {code}
> related code:
> {code:java}
> def createChunkBlockInfosFromMetaResponse(
> shuffleId: Int,
> shuffleMergeId: Int,
> reduceId: Int,
> blockSize: Long,
> bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
>   val approxChunkSize = blockSize / bitmaps.length
>   val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
>   for (i <- bitmaps.indices) {
> val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, 
> reduceId, i)
> chunksMetaMap.put(blockChunkId, bitmaps(i))
> logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
> blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
>   }
>   blocksToFetch
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-42834) Divided by zero occurs in PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse

2023-03-17 Thread Li Ying (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Ying closed SPARK-42834.
---

> Divided by zero occurs in 
> PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse
> 
>
> Key: SPARK-42834
> URL: https://issues.apache.org/jira/browse/SPARK-42834
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Li Ying
>Priority: Major
>
> {color:#22}Sometimes when run a SQL job with push based shuffle, 
> exception occurs as below.  It seems that there’s no element in the bitmaps 
> which stores merge chunk meta. {color}
> {color:#22}Is it a bug that we should not createChunkBlockInfos when 
> bitmaps is empty or the bitmaps should never be empty here ?{color}
>  
> {code:java}
> Caused by: java.lang.ArithmeticException: / by zero
> at 
> org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
>  {code}
> related code:
> {code:java}
> def createChunkBlockInfosFromMetaResponse(
> shuffleId: Int,
> shuffleMergeId: Int,
> reduceId: Int,
> blockSize: Long,
> bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
>   val approxChunkSize = blockSize / bitmaps.length
>   val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
>   for (i <- bitmaps.indices) {
> val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, 
> reduceId, i)
> chunksMetaMap.put(blockChunkId, bitmaps(i))
> logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
> blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
>   }
>   blocksToFetch
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42834) Divided by zero occurs in PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse

2023-03-17 Thread Li Ying (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Ying resolved SPARK-42834.
-
Resolution: Won't Do

> Divided by zero occurs in 
> PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse
> 
>
> Key: SPARK-42834
> URL: https://issues.apache.org/jira/browse/SPARK-42834
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Li Ying
>Priority: Major
>
> {color:#22}Sometimes when run a SQL job with push based shuffle, 
> exception occurs as below.  It seems that there’s no element in the bitmaps 
> which stores merge chunk meta. {color}
> {color:#22}Is it a bug that we should not createChunkBlockInfos when 
> bitmaps is empty or the bitmaps should never be empty here ?{color}
>  
> {code:java}
> Caused by: java.lang.ArithmeticException: / by zero
> at 
> org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:117)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:980)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
>  {code}
> related code:
> {code:java}
> def createChunkBlockInfosFromMetaResponse(
> shuffleId: Int,
> shuffleMergeId: Int,
> reduceId: Int,
> blockSize: Long,
> bitmaps: Array[RoaringBitmap]): ArrayBuffer[(BlockId, Long, Int)] = {
>   val approxChunkSize = blockSize / bitmaps.length
>   val blocksToFetch = new ArrayBuffer[(BlockId, Long, Int)]()
>   for (i <- bitmaps.indices) {
> val blockChunkId = ShuffleBlockChunkId(shuffleId, shuffleMergeId, 
> reduceId, i)
> chunksMetaMap.put(blockChunkId, bitmaps(i))
> logDebug(s"adding block chunk $blockChunkId of size $approxChunkSize")
> blocksToFetch += ((blockChunkId, approxChunkSize, SHUFFLE_PUSH_MAP_ID))
>   }
>   blocksToFetch
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38973) When push-based shuffle is enabled, a stage may not complete when retried

2023-03-20 Thread Li Ying (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702556#comment-17702556
 ] 

Li Ying commented on SPARK-38973:
-

[~csingh] Should this bugfix be merged into 3.2.x branches?

> When push-based shuffle is enabled, a stage may not complete when retried
> -
>
> Key: SPARK-38973
> URL: https://issues.apache.org/jira/browse/SPARK-38973
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.3.0
>
>
> With push-based shuffle enabled and adaptive merge finalization, there are 
> scenarios where a re-attempt of ShuffleMapStage may not complete. 
> With Adaptive Merge Finalization, a stage may be triggered for finalization 
> when it is in the below state:
>  # The stage is *not* running ({*}not{*} in the _running_ set of the 
> DAGScheduler) - had failed or canceled or waiting, and
>  # The stage has no pending partitions (all the tasks completed at-least 
> once).
> For such a stage when the finalization completes, the stage will still not be 
> marked as {_}mergeFinalized{_}. 
> The stage of the stage will be: 
>  * _stage.shuffleDependency.mergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask = finalizeTask_
>  * Merged statuses of the state are unregistered
>  
> When the stage is resubmitted, the newer attempt of the stage will never 
> complete even though its tasks may be completed. This is because the newer 
> attempt of the stage will have {_}shuffleMergeEnabled = true{_}, since with 
> the previous attempt the stage was never marked as {_}mergedFinalized{_}, and 
> the _finalizeTask_ is present (from finalization attempt for previous stage 
> attempt).
>  
> So, when all the tasks of the newer attempt complete, then these conditions 
> will be true:
>  * stage will be running
>  * There will be no pending partitions since all the tasks completed
>  * _stage.shuffleDependency.shuffleMergeEnabled = true_
>  * _stage.shuffleDependency.shuffleMergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask_ is not empty
> This leads the DAGScheduler to try scheduling finalization and not trigger 
> the completion of the Stage. However because of the last condition it never 
> even schedules the finalization and the stage never completes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org