Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-15 Thread Purushotham Pushpavanthar
Hi, Below is a create statement on my Hudi dataset. *CREATE EXTERNAL TABLE `inventory`.`customer`(`_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` bigint, `sales` bigint, `merc

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-15 Thread Vinoth Chandar
Hi, are you setting the path filters when you query the Hudi Hive table via Spark http://hudi.apache.org/querying_data.html#spark-ro-view (or http://hudi.apache.org/querying_data.html#spark-rt-view alternatively)? - Vinoth On Fri, Nov 15, 2019 at 5:03 AM Purushotham Pushpavanthar < pushpavant...

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-15 Thread Kabeer Ahmed
Adding to Vinoth's response, in spark-shell you just need to copy and paste the below line. Let us know if it still doesnt work. spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-16 Thread Purushotham Pushpavanthar
Thanks Vinoth and Kabeer. It resolved my problem. Regards, Purushotham Pushpavanth On Fri, 15 Nov 2019 at 20:16, Kabeer Ahmed wrote: > Adding to Vinoth's response, in spark-shell you just need to copy and > paste the below line. Let us know if it still doesnt work. > > spark.sparkContext.hado

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-16 Thread Vinoth Chandar
Sweet! On Sat, Nov 16, 2019 at 10:16 AM Purushotham Pushpavanthar < pushpavant...@gmail.com> wrote: > Thanks Vinoth and Kabeer. It resolved my problem. > > Regards, > Purushotham Pushpavanth > > > > On Fri, 15 Nov 2019 at 20:16, Kabeer Ahmed wrote: > > > Adding to Vinoth's response, in spark-she

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Pratyaksh Sharma
Hi Vinoth/Kabeer, I have one small doubt regarding what you proposed to fix the issue. Why is HoodieParquetInputFormat class not able to handle deduplication of records in case of spark while it is able to do so in case of presto and hive? On Sun, Nov 17, 2019 at 4:08 AM Vinoth Chandar wrote: >

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Purushotham Pushpavanthar
Kabeer, can you please share *PySpark* command to register pathfileter class? Regards, Purushotham Pushpavanth On Mon, 18 Nov 2019 at 13:46, Pratyaksh Sharma wrote: > Hi Vinoth/Kabeer, > > I have one small doubt regarding what you proposed to fix the issue. Why is > HoodieParquetInputFormat c

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Purushotham Pushpavanthar
Figured out. Below command worked for me in PySpark. *spark._jsc.hadoopConfiguration().set('mapreduce.input.pathFilter.class','org.apache.hudi.hadoop.HoodieROTablePathFilter')* Regards, Purushotham Pushpavanth On Mon, 18 Nov 2019 at 16:47, Purushotham Pushpavanthar < pushpavant...@gmail.com> w

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-18 Thread Bhavani Sudha
Hi Pratyaksh, Let me try to answer this. I believe spark does not natively invoke HoodieParquetInputFormat.getSplits() like Hive and Presto does. So when queried, spark just loads all the data files in that partition without applying Hoodie filtering logic. Thats why we need to instruct Spark to r

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-19 Thread Kabeer Ahmed
this helps Kabeer.     Sent: Monday, November 18, 2019 at 7:47 PM From: "Bhavani Sudha" To: dev@hudi.apache.org Subject: Re: Spark v2.3.2 : Duplicate entries found for each primary Key Hi Pratyaksh, Let me try to answer this. I believe spark does not nativ

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-19 Thread Pratyaksh Sharma
k. > > Hope this helps > Kabeer. > > > > Sent: Monday, November 18, 2019 at 7:47 PM > From: "Bhavani Sudha" > To: dev@hudi.apache.org > Subject: Re: Spark v2.3.2 : Duplicate entries found for each primary Key > Hi Pratyaksh, > > Let me try to answer thi

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-20 Thread Kabeer Ahmed
ake sometime from you to go through these paths and digest the > > flow. But should you still have any questions, please do not hesitate to > > revert back. > > > > Hope this helps > > Kabeer. > > > > > > > > Sent: Monday, November 18, 2019 at 7:47

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-20 Thread Bhavani Sudha
data within them. > > > > > > It may take sometime from you to go through these paths and digest the > > > flow. But should you still have any questions, please do not hesitate > to > > > revert back. > > > > > > Hope this helps > > > Kab

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

2019-11-25 Thread Bhavani Sudha
e sometime from you to go through these paths and digest the > flow. But should you still have any questions, please do not hesitate > > to > > revert back. > > Hope this helps > Kabeer. > > > > Sent: Monday, November 18, 2019 at 7:47 PM > From: "Bhavani