Hi ,

While running the following spark code in the cluster with following
configuration it is spread into  3 job Id's

CLUSTER CONFIGURATION

3 NODE CLUSTER

NODE 1 - 64GB 16CORES

NODE 2 - 64GB 16CORES

NODE 3 - 64GB 16CORES


At Job Id 2 job is stuck at the stage 51 of 254 and then it starts
utilising the disk space I am not sure why is this happening and my work is
completely ruined . could someone help me on this

I have attached screen shot of spark stages which are stuck for reference

Please let me know for more questions with the setup and code
Thanks



code:

   def main(args: Array[String]) {

    Logger.getLogger("org").setLevel(Level.ERROR)

    val ss = SparkSession

      .builder

      .appName("join_association").master("local[*]")

      .getOrCreate()

      import ss.implicits._

     val dframe = ss.read.option("inferSchema",
value=true).option("delimiter", ",").csv("in/matrimony.txt")

     dframe.show()

     dframe.printSchema()

     //left_frame



     val dfLeft = dframe.withColumnRenamed("_c1", "left_data")



     val dfRight = dframe.withColumnRenamed("_c1", "right_data")



     //Join



     val joined = dfLeft.join(dfRight , dfLeft.col("_c0") ===
dfRight.col("_c0") ).filter(col("left_data") !== col("right_data") )



      joined.show()



    val result = joined.select(col("left_data"), col("right_data") as
"similar_ids" )



    result.write.csv("/output")

    ss.stop()



  }



-- 
REGARDS
BALAKUMAR SEETHARAMAN
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to