[jira] [Updated] (SPARK-37821) spark thrift server RDD ID overflow lead sql execute failed

muhong (Jira) Thu, 06 Jan 2022 18:01:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-37821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


muhong updated SPARK-37821:
---------------------------
    Description: 
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

if rddId overflow , when rdd.MapPartition execute , error will occur, the error 
is occur on driver side, when driver deserialize block id from "block message" 
inputstream

when executor invoke rdd.MapPartition, it will call block manager to report 
block status,  the the block id is negative，when the message send back to 
driver , the driver regex will failed match and throw an exception

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private var nextRddId = new AtomicInteger(0) // change happen
  /** Register a new RDD, returning its RDD ID */  
// change happen
private[spark] def newRddId(): Int = {
  var id = nextRddId.getAndIncrement()
  if (id > 0) {
    return id
  }
  this.synchronized {
    id = nextRddId.getAndIncrement()
   if (id < 0) {       
      nextRddId = new AtomicInteger(0)
      id = nextRddId.getAndIncrement()
    }
  }
  id
}
...
...{code}

  was:
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

 

if rddId overflow , when rdd.MapPartition execute , error will occure

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private var nextRddId = new AtomicInteger(0) // change happen
  /** Register a new RDD, returning its RDD ID */  
// change happen
private[spark] def newRddId(): Int = {
  var id = nextRddId.getAndIncrement()
  if (id > 0) {
    return id
  }
  this.synchronized {
    id = nextRddId.getAndIncrement()
   if (id < 0) {       
      nextRddId = new AtomicInteger(0)
      id = nextRddId.getAndIncrement()
    }
  }
  id
}
...
...{code}


> spark thrift server RDD ID overflow lead sql execute failed
> -----------------------------------------------------------
>
>                 Key: SPARK-37821
>                 URL: https://issues.apache.org/jira/browse/SPARK-37821
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: muhong
>            Priority: Major
>
> this problem will happen in long run spark application，such as thrift server；
> as only one SparkContext instance in thrift server driver size，so if the 
> concurrency of sql request is large or the sql is too complicate（this will 
> create a lot of rdd）, the rdd will be generate too fast , the rdd id 
> （SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
>  ）will be consume fast, after a few months the nextRddId will overflow。the 
> newRddId will be negative number，but the rdd's block id need to be positive, 
> so this will lead a exception"Failed to parse rdd_-2123452330_2 into block 
> ID"（rdd block id formate“val RDD = 
> "rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
>  can not exchange data during sql execution, and lead sql execute failed
> if rddId overflow , when rdd.MapPartition execute , error will occur, the 
> error is occur on driver side, when driver deserialize block id from "block 
> message" inputstream
> when executor invoke rdd.MapPartition, it will call block manager to report 
> block status,  the the block id is negative，when the message send back to 
> driver , the driver regex will failed match and throw an exception
>  
> how to fix the problem???
> SparkContext.scala
>  
> {code:java}
> ...
> ...
>  private val nextShuffleId = new AtomicInteger(0)
>   private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
>   private var nextRddId = new AtomicInteger(0) // change happen
>   /** Register a new RDD, returning its RDD ID */  
> // change happen
> private[spark] def newRddId(): Int = {
>   var id = nextRddId.getAndIncrement()
>   if (id > 0) {
>     return id
>   }
>   this.synchronized {
>     id = nextRddId.getAndIncrement()
>    if (id < 0) {       
>       nextRddId = new AtomicInteger(0)
>       id = nextRddId.getAndIncrement()
>     }
>   }
>   id
> }
> ...
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37821) spark thrift server RDD ID overflow lead sql execute failed

Reply via email to