Missing / Duplicate Data when Spark retries

2020-09-09 Thread Ruijing Li
Hi all, I am on Spark 2.4.4 using Mesos as the task resource scheduler. The context is my job maps over multiple datasets, for each dataset it takes one dataframe from a parquet file from one HDFS path, and another dataframe from second HDFS path, unions them by name, then deduplicate by most

subscribe user@spark.apache.org

2020-09-09 Thread Joan
i want to subscribeuser@spark.apache.org??thanks a lot??