Re: Spark v2.3.2 : Duplicate entries found for each primary Key

Bhavani Sudha Wed, 20 Nov 2019 15:01:12 -0800

Thanks Kabeer. I think this would be a great faq question for new users.

I think you should already be able to contribute to faqs like mentioned
here - https://hudi.incubator.apache.org/community.html#accounts
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-ContributingtoFAQ
. Please let me know if you see any troubles.


Thanks,
Sudha

On Wed, Nov 20, 2019 at 9:20 AM Kabeer Ahmed <[email protected]> wrote:

> Sudha
>
> Do you think this is a good addition to FAQ? Since this might be a common
> question when a new user gets to see the hudi documentation. You could add
> it or I am happy to do it if you give me access. My apache jira id is:
> smdahmed.
> Thanks,
> Kabeer.
>
> On Nov 20 2019, at 7:23 am, Pratyaksh Sharma <[email protected]>
> wrote:
> > Thank you for the explanation Kabeer/Sudha.
> >
> > Let me go through the flow and revert back in case of any further
> queries.
> > On Wed, Nov 20, 2019 at 6:21 AM Kabeer Ahmed <[email protected]>
> wrote:
> > > Pratyaksh,
> > > +1 to what Sudha has written. Lets zoom a bit closer.
> > > For hive, as you said, we explicitly set input format to
> > > HoodieParquetInputFormat.
> > > - HoodieParquetInputFormat extends MapredParquetInputFormat which is
> > > nothing but a input format for hive. Hive and Presto depend on this
> file to
> > > retrieve dataset from Hudi.
> > >
> > > For Spark, there is no such option to set this explicitly. Rather Spark
> > > starts reading the paths direct from the file system (HDFS or S3). From
> > > Spark the calls would be as below:
> > > - org.apache.spark.rdd.NewHadoopRDD.getPartitions
> > > - org.apache.parquet.hadoop.ParquetInputFormat.getSplits
> > > - org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits
> > >
> > > Now it is evident that we cant stick the HoodieParquetInputFormat.
> Rather
> > > we rely on the PathFilter class that allows us to filter out the paths
> (and
> > > files). So we explicitly set this in the Spark Hadoop Configuration
> (note
> > > that Spark uses Hadoop FS S3 implementation to read from S3).
> > >
> > > If you look into the file: HoodieROTablePathFilter, you will see that
> > > there is logic to ensure that folders (paths) or files for Hoodie
> related
> > > files always ensures that latest path/file is selected. Thus you do
> not see
> > > duplicate entries when you set. Without this, Spark is just plainly
> reading
> > > all the parquet files and displaying the data within them.
> > >
> > > It may take sometime from you to go through these paths and digest the
> > > flow. But should you still have any questions, please do not hesitate
> to
> > > revert back.
> > >
> > > Hope this helps
> > > Kabeer.
> > >
> > >
> > >
> > > Sent: Monday, November 18, 2019 at 7:47 PM
> > > From: "Bhavani Sudha" <[email protected]>
> > > To: [email protected]
> > > Subject: Re: Spark v2.3.2 : Duplicate entries found for each primary
> Key
> > > Hi Pratyaksh,
> > >
> > > Let me try to answer this. I believe spark does not natively invoke
> > > HoodieParquetInputFormat.getSplits() like Hive and Presto does. So when
> > > queried, spark just loads all the data files in that partition without
> > > applying Hoodie filtering logic. Thats why we need to instruct Spark to
> > > read in the appropriate format in one of the two ways suggested by
> > > Vinoth/Kabeer earlier.
> > >
> > > Thanks,
> > > Sudha
> > >
> > > On Mon, Nov 18, 2019 at 12:16 AM Pratyaksh Sharma <
> [email protected]>
> > > wrote:
> > >
> > > > Hi Vinoth/Kabeer,
> > > > I have one small doubt regarding what you proposed to fix the issue.
> Why
> > > is
> > > > HoodieParquetInputFormat class not able to handle deduplication of
> > >
> > > records
> > > > in case of spark while it is able to do so in case of presto and
> hive?
> > > >
> > > > On Sun, Nov 17, 2019 at 4:08 AM Vinoth Chandar <[email protected]>
> > > wrote:
> > > >
> > > > > Sweet!
> > > > > On Sat, Nov 16, 2019 at 10:16 AM Purushotham Pushpavanthar <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Thanks Vinoth and Kabeer. It resolved my problem.
> > > > > > Regards,
> > > > > > Purushotham Pushpavanth
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, 15 Nov 2019 at 20:16, Kabeer Ahmed <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Adding to Vinoth's response, in spark-shell you just need to
> copy
> > > and
> > > > > > > paste the below line. Let us know if it still doesnt work.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > > > > > > classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
> > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);
> > > > > > > On Nov 15 2019, at 1:37 pm, Vinoth Chandar <[email protected]>
> > > > > >
> > > > >
> > > >
> > > > wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > are you setting the path filters when you query the Hudi Hive
> > > table
> > > > > via
> > > > > > > > Spark
> > > > > > > > http://hudi.apache.org/querying_data.html#spark-ro-view (or
> > > > > > > >
> > >
> http://hudi.apache.org/querying_data.html#spark-rt-view[http://hudi.apache.org/querying_data.html#spark-rt-view]
> > > > > > alternatively)?
> > > > > > > >
> > > > > > > > - Vinoth
> > > > > > > > On Fri, Nov 15, 2019 at 5:03 AM Purushotham Pushpavanthar <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > Below is a create statement on my Hudi dataset.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *CREATE EXTERNAL TABLE
> > > > `inventory`.`customer`(`_hoodie_commit_time`
> > > > > > > string,
> > > > > > > > > `_hoodie_commit_seqno` string, `_hoodie_record_key` string,
> > > > > > > > > `_hoodie_partition_path` string, `_hoodie_file_name`
> string,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > `id`
> > > > > > > bigint,
> > > > > > > > > `sales` bigint, `merchant` bigint, `item_status` bigint,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > `tem_shipment`
> > > > > > > > > bigint)PARTITIONED BY (`dt` string)ROW FORMAT SERDE
> > > > > > > > > 'org.apache.hadoop.hive.ql.io
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > .parquet.serde.ParquetHiveSerDe'WITH
> > > > > > > > > SERDEPROPERTIES ( 'serialization.format' = '1')STORED AS
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > INPUTFORMAT
> > > > > > > > > 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
> OUTPUTFORMAT
> > > > > > > > > 'org.apache.hadoop.hive.ql.io
> > > > > > > >
> > > > > > >
> > > > > > > .parquet.MapredParquetOutputFormat'LOCATION
> > > > > > > > >
> > > 's3://<warehouse-bucket>/<path>/inventory/customer'TBLPROPERTIES
> > > > (
> > > > > > > > > 'bucketing_version' = '2', 'transient_lastDdlTime' =
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > '1572952974',
> > > > > > > > > 'last_commit_time_sync' = '20191114192136')*
> > > > > > > > >
> > > > > > > > > I've taken care of adding
> *hudi-hive-bundle-0.5.1-SNAPSHOT.jar*
> > > > in
> > > > > > > Hive,
> > > > > > > > > *hudi-presto-bundle-0.5.1-SNAPSHOT.jar* in Presto and
> > > > > > > > > *hudi-spark-bundle-0.5.1-SNAPSHOT.jar
> > > > > > > > > *in Spark (All three share common Metastore).
> > > > > > > > > We are running Hudi in COW mode and we noticed that there
> are
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > multiple
> > > > > > > > > versions of the .parquet files
> > > > > > > > > written per partitions depending on number of updates
> coming to
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > them
> > > > > > > over
> > > > > > > > > each batch execution. When queried from Hive and Presto
> > > > > > > > > for any Primary Key having multiple updates, it returns
> single
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > record
> > > > > > > with
> > > > > > > > > latest state(I assume *HoodieParquetInputFormat* does the
> magic
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > of
> > > > > > > taking
> > > > > > > > > care of duplicates). Whereas, when I tried to execute the
> same
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > query
> > > > > > > > > in Spark SQL, I get duplicated records for any Primary Key
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > having
> > > > > > > multiple
> > > > > > > > > updates.
> > > > > > > > >
> > > > > > > > > Can someone help me understand why Spark is not able to
> handle
> > > > > > > > > deduplication of records across multiple commits which
> Presto
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > and
> > > > > > Hive
> > > > > > > are
> > > > > > > > > able to do?
> > > > > > > > > I've taken care of providing
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > hudi-spark-bundle-0.5.1-SNAPSHOT.jar
> > > > > > while
> > > > > > > > > starting spark-shell. Is there something that I'm missing?
> > > > > > > > >
> > > > > > > > > Thanks in advance.
> > > > > > > > > Regards,
> > > > > > > > > Purushotham Pushpavanth
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
>

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

Reply via email to