Re: Let multiple jobs share one rdd？

Khalid Mammadov Thu, 24 Sep 2020 11:17:56 -0700

Perhaps you can use Global Temp Views?

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createGlobalTempView



On 24/09/2020 14:52, Gang Li wrote:

Hi all,

There are three jobs, among which the first rdd is the same. Can the first
rdd be calculated once, and then the subsequent operations will be
calculated in parallel?

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png>

My code is as follows:

sqls = ["""
             INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
(dt='20200917')
             SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey
             FROM temp_table where status=3""",
             """
             INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
(dt='20200917')
             SELECT id, cur_inst_id, status, update_time, schedule_time,
task_name
             FROM temp_table where schedule_time > '2020-09-01 00:00:00'
             """]

def multi_thread():
     sql = """SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
scheduler_id
         FROM table
         where dt < '20200801'"""
     data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
     threads = []
     for i in range(2):
         try:
             t = threading.Thread(target=insert_overwrite_thread,
args=(sqls[i], data, ))
             t.start()
             threads.append(t)
         except Exception as x:
             print x
     for t in threads:
         t.join()

def insert_overwrite_thread(sql, data):
     data.createOrReplaceTempView('temp_table')
     spark.sql(sql)



Since spark is in lazy mode, the first RDD will still be calculated multiple
times during parallel submission.
I would like to ask you if there are other ways, thanks！

Cheers,
Gang Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Let multiple jobs share one rdd？

Reply via email to