date:20200909

Missing / Duplicate Data when Spark retries

2020-09-09 Thread Ruijing Li

Hi all, I am on Spark 2.4.4 using Mesos as the task resource scheduler. The context is my job maps over multiple datasets, for each dataset it takes one dataframe from a parquet file from one HDFS path, and another dataframe from second HDFS path, unions them by name, then deduplicate by most rece

subscribe user@spark.apache.org

2020-09-09 Thread Joan

i want to subscribe user@spark.apache.org ??thanks a lot??