Re: Spark v2.3.2 : Duplicate entries found for each primary Key

Pratyaksh Sharma Mon, 18 Nov 2019 00:16:56 -0800

Hi Vinoth/Kabeer,

I have one small doubt regarding what you proposed to fix the issue. Why is
HoodieParquetInputFormat class not able to handle deduplication of records
in case of spark while it is able to do so in case of presto and hive?


On Sun, Nov 17, 2019 at 4:08 AM Vinoth Chandar <vin...@apache.org> wrote:

> Sweet!
>
> On Sat, Nov 16, 2019 at 10:16 AM Purushotham Pushpavanthar <
> pushpavant...@gmail.com> wrote:
>
> > Thanks Vinoth and Kabeer. It resolved my problem.
> >
> > Regards,
> > Purushotham Pushpavanth
> >
> >
> >
> > On Fri, 15 Nov 2019 at 20:16, Kabeer Ahmed <kab...@linuxmail.org> wrote:
> >
> > > Adding to Vinoth's response, in spark-shell you just need to copy and
> > > paste the below line. Let us know if it still doesnt work.
> > >
> > >
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > > classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
> > > classOf[org.apache.hadoop.fs.PathFilter]);
> > > On Nov 15 2019, at 1:37 pm, Vinoth Chandar <vin...@apache.org> wrote:
> > > > Hi,
> > > >
> > > > are you setting the path filters when you query the Hudi Hive table
> via
> > > > Spark
> > > > http://hudi.apache.org/querying_data.html#spark-ro-view (or
> > > > http://hudi.apache.org/querying_data.html#spark-rt-view
> > alternatively)?
> > > >
> > > > - Vinoth
> > > > On Fri, Nov 15, 2019 at 5:03 AM Purushotham Pushpavanthar <
> > > > pushpavant...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > > Below is a create statement on my Hudi dataset.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *CREATE EXTERNAL TABLE `inventory`.`customer`(`_hoodie_commit_time`
> > > string,
> > > > > `_hoodie_commit_seqno` string, `_hoodie_record_key` string,
> > > > > `_hoodie_partition_path` string, `_hoodie_file_name` string, `id`
> > > bigint,
> > > > > `sales` bigint, `merchant` bigint, `item_status` bigint,
> > `tem_shipment`
> > > > > bigint)PARTITIONED BY (`dt` string)ROW FORMAT SERDE
> > > > > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH
> > > > > SERDEPROPERTIES ( 'serialization.format' = '1')STORED AS
> INPUTFORMAT
> > > > > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT
> > > > > 'org.apache.hadoop.hive.ql.io
> > > .parquet.MapredParquetOutputFormat'LOCATION
> > > > > 's3://<warehouse-bucket>/<path>/inventory/customer'TBLPROPERTIES (
> > > > > 'bucketing_version' = '2', 'transient_lastDdlTime' = '1572952974',
> > > > > 'last_commit_time_sync' = '20191114192136')*
> > > > >
> > > > > I've taken care of adding *hudi-hive-bundle-0.5.1-SNAPSHOT.jar* in
> > > Hive,
> > > > > *hudi-presto-bundle-0.5.1-SNAPSHOT.jar* in Presto and
> > > > > *hudi-spark-bundle-0.5.1-SNAPSHOT.jar
> > > > > *in Spark (All three share common Metastore).
> > > > > We are running Hudi in COW mode and we noticed that there are
> > multiple
> > > > > versions of the .parquet files
> > > > > written per partitions depending on number of updates coming to
> them
> > > over
> > > > > each batch execution. When queried from Hive and Presto
> > > > > for any Primary Key having multiple updates, it returns single
> record
> > > with
> > > > > latest state(I assume *HoodieParquetInputFormat* does the magic of
> > > taking
> > > > > care of duplicates). Whereas, when I tried to execute the same
> query
> > > > > in Spark SQL, I get duplicated records for any Primary Key having
> > > multiple
> > > > > updates.
> > > > >
> > > > > Can someone help me understand why Spark is not able to handle
> > > > > deduplication of records across multiple commits which Presto and
> > Hive
> > > are
> > > > > able to do?
> > > > > I've taken care of providing hudi-spark-bundle-0.5.1-SNAPSHOT.jar
> > while
> > > > > starting spark-shell. Is there something that I'm missing?
> > > > >
> > > > > Thanks in advance.
> > > > > Regards,
> > > > > Purushotham Pushpavanth
> > > >
> > > >
> > >
> > >
> >
>

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

Reply via email to