Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Ruijing Li
I agree Sean, although its strange since we aren’t using any UDFs but sticking to spark provided functions. If anyone in the community has seen such an issue before I would be happy to learn more! On Thu, Sep 10, 2020 at 6:01 AM Sean Owen wrote: > It's more likely a subtle issue with your code

Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Sean Owen
It's more likely a subtle issue with your code or data, but hard to say without knowing more. The lineage is fine and deterministic, but your data or operations might not be. On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li wrote: > > Hi all, > > I am on Spark 2.4.4 using Mesos as the task resource

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-09-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi All, We tried to regenerate the TPC DS data on S3 and after regeneration, we see that the queries are running faster and the execution time is now comparable with execution time on HDFS with Spark 3.0.0. So may be there was some issue in generating the TPC DS data first time due to which we