Re: About github issue 639

Jun Zhu Fri, 26 Apr 2019 02:10:21 -0700

Hi Vinoth,
Some continue question about this thread.
Here is what I found after running a few days:
in .hoodie folder, due to retain policy maybe, there is an obviously
line(list in the end of email). Before it the cleaned commit was archived,
find duplication when query inflight commit correspond partition by
spark-shell. After the line, all behave normal, global dedup works.
The duplicates was found in inflight commit parquet files. Wondering if
this was expected?
Q:
1.  The inflight commit should be turned to roll back status in next
writes. Is it normal that so many inflight commit did not make it? Or can I
config a retain policy to turn inflight to roll_back in another way?
2. Did commit retain policy do not archive inflight commit?


2019-04-23 20:23:47        378 20190423122339.deltacommit.inflight

2019-04-23 20:43:53        378 20190423124343.deltacommit.inflight

2019-04-23 22:14:04        378 20190423141354.deltacommit.inflight

2019-04-23 22:44:09        378 20190423144400.deltacommit.inflight

2019-04-23 22:54:18        378 20190423145408.deltacommit.inflight

2019-04-23 23:04:09        378 20190423150400.deltacommit.inflight

2019-04-23 23:24:30        378 20190423152421.deltacommit.inflight

*2019-04-23 23:44:34        378 20190423154424.deltacommit.inflight*

*2019-04-24 00:15:46       2991 20190423161431.clean*

2019-04-24 00:15:21     870536 20190423161431.deltacommit

2019-04-24 00:25:19       2991 20190423162424.clean

2019-04-24 00:25:09     875825 20190423162424.deltacommit

2019-04-24 00:35:26       2991 20190423163429.clean

2019-04-24 00:35:18     881925 20190423163429.deltacommit

2019-04-24 00:46:14       2991 20190423164428.clean

2019-04-24 00:45:44     888025 20190423164428.deltacommit

Thanks,
Jun

On 2019/04/18 14:29:23, Vinoth Chandar <[email protected]> wrote:
> Hi Jun,>
>
> Responses below.>
>
> >>1. Some file inflight may never reach commit?>
> yes. the next attempt at writing will first issue a rollback to clean up>
> such partial/leftover files first, before it begins the new commit.>
>
> >>2. In occasion which inflight and parquet file generated by inflight
still>
> exists,  the global dedup will not dedup based on such kind file?>
> even if not rolled back, we check for the inflight parquet files against>
> the committed timeline, which it wont be a part of. So should be safe.>
>
>
> >>3. In occasion which inflight and parquet file generated by inflight
still>
> exists,  the correct query result will be decided by read config(I>
> mean mapreduce.input.pathFilter.class>
> in sparksql)>
> yes. the filtering should work as well. its the same technique used by>
> writer.>
>
>
> >>4. Is there any way we can use>
>
> >>
>
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>

> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > classOf[org.apache.hadoop.fs.PathFilter]);>
>
> in spark thrift server when start it?>
>
> I am not familiar with the Spark thrift server myself. Any pointers where
I>
> can learn more?>
> Two suggestions :>
> - You can check if you can add this to the Hadoop configuration xml
files>
> and see if it gets picked up by Spark?>
> - Alternatively, you can set the spark config mentioned here>
> http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro
view>
> also), which should be doable I am assuming at this thrift server>
>
>
> Thanks>
> Vinoth>
>
>
> On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu <[email protected]>
wrote:>
>
> > Hi,>
> > Link: https://github.com/apache/incubator-hudi/issues/639>
> > Sorry , failed open
https://lists.apache.org/[email protected]>
> > .>
> > I have some follow up questions for issue 639:>
> >>
> > So, the sequence of events is . We write parquet files and then upon>
> > > successful writing of all attempted parquet files, we actually make
the>
> > > commit as completed. (i.e not inflight anymore). So this is normal.
This>
> > is>
> > > done to prevent queries from reading partially written parquet
files..>
> > >>
> >>
> > Does that mean:>
> > 1. Some file inflight may never reach commit?>
> > 2. In occasion which inflight and parquet file generated by inflight
still>
> > exists,  the global dedup will not dedup based on such kind file?>
> > 3. In occasion which inflight and parquet file generated by inflight
still>
> > exists,  the correct query result will be decided by read config(I>
> > mean mapreduce.input.pathFilter.class>
> > in sparksql)>
> > 4. Is there any way we can use>
> >>
> > >>
> >
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>

> > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > classOf[org.apache.hadoop.fs.PathFilter]);>
> >>
> > in spark thrift server when start it?>
> >>
> > Best,>
> > -->
> > [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*>
> > Sr. Engineer I, Data>
> > ＋86 18565739171>
> >>
> > [image: in1552694272.png] <https://www.linkedin.com/company/vungle>>
> > [image:>
> > fb1552694203.png] <https://facebook.com/vungle>      [image:>
> > tw1552694330.png] <https://twitter.com/vungle>      [image:>
> > ig1552694392.png] <https://www.instagram.com/vungle>>
> > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China>
> >>
>

Re: About github issue 639

Reply via email to