Hi Vinoth, Some continue question about this thread. Here is what I found after running a few days: in .hoodie folder, due to retain policy maybe, there is an obviously line(list in the end of email). Before it the cleaned commit was archived, find duplication when query inflight commit correspond partition by spark-shell. After the line, all behave normal, global dedup works. The duplicates was found in inflight commit parquet files. Wondering if this was expected? Q: 1. The inflight commit should be turned to roll back status in next writes. Is it normal that so many inflight commit did not make it? Or can I config a retain policy to turn inflight to roll_back in another way? 2. Did commit retain policy do not archive inflight commit?
2019-04-23 20:23:47 378 20190423122339.deltacommit.inflight 2019-04-23 20:43:53 378 20190423124343.deltacommit.inflight 2019-04-23 22:14:04 378 20190423141354.deltacommit.inflight 2019-04-23 22:44:09 378 20190423144400.deltacommit.inflight 2019-04-23 22:54:18 378 20190423145408.deltacommit.inflight 2019-04-23 23:04:09 378 20190423150400.deltacommit.inflight 2019-04-23 23:24:30 378 20190423152421.deltacommit.inflight *2019-04-23 23:44:34 378 20190423154424.deltacommit.inflight* *2019-04-24 00:15:46 2991 20190423161431.clean* 2019-04-24 00:15:21 870536 20190423161431.deltacommit 2019-04-24 00:25:19 2991 20190423162424.clean 2019-04-24 00:25:09 875825 20190423162424.deltacommit 2019-04-24 00:35:26 2991 20190423163429.clean 2019-04-24 00:35:18 881925 20190423163429.deltacommit 2019-04-24 00:46:14 2991 20190423164428.clean 2019-04-24 00:45:44 888025 20190423164428.deltacommit Thanks, Jun On 2019/04/18 14:29:23, Vinoth Chandar <[email protected]> wrote: > Hi Jun,> > > Responses below.> > > >>1. Some file inflight may never reach commit?> > yes. the next attempt at writing will first issue a rollback to clean up> > such partial/leftover files first, before it begins the new commit.> > > >>2. In occasion which inflight and parquet file generated by inflight still> > exists, the global dedup will not dedup based on such kind file?> > even if not rolled back, we check for the inflight parquet files against> > the committed timeline, which it wont be a part of. So should be safe.> > > > >>3. In occasion which inflight and parquet file generated by inflight still> > exists, the correct query result will be decided by read config(I> > mean mapreduce.input.pathFilter.class> > in sparksql)> > yes. the filtering should work as well. its the same technique used by> > writer.> > > > >>4. Is there any way we can use> > > >> > spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",> > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],> > > classOf[org.apache.hadoop.fs.PathFilter]);> > > in spark thrift server when start it?> > > I am not familiar with the Spark thrift server myself. Any pointers where I> > can learn more?> > Two suggestions :> > - You can check if you can add this to the Hadoop configuration xml files> > and see if it gets picked up by Spark?> > - Alternatively, you can set the spark config mentioned here> > http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro view> > also), which should be doable I am assuming at this thrift server> > > > Thanks> > Vinoth> > > > On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu <[email protected]> wrote:> > > > Hi,> > > Link: https://github.com/apache/incubator-hudi/issues/639> > > Sorry , failed open https://lists.apache.org/[email protected]> > > .> > > I have some follow up questions for issue 639:> > >> > > So, the sequence of events is . We write parquet files and then upon> > > > successful writing of all attempted parquet files, we actually make the> > > > commit as completed. (i.e not inflight anymore). So this is normal. This> > > is> > > > done to prevent queries from reading partially written parquet files..> > > >> > >> > > Does that mean:> > > 1. Some file inflight may never reach commit?> > > 2. In occasion which inflight and parquet file generated by inflight still> > > exists, the global dedup will not dedup based on such kind file?> > > 3. In occasion which inflight and parquet file generated by inflight still> > > exists, the correct query result will be decided by read config(I> > > mean mapreduce.input.pathFilter.class> > > in sparksql)> > > 4. Is there any way we can use> > >> > > >> > > spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",> > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],> > > > classOf[org.apache.hadoop.fs.PathFilter]);> > >> > > in spark thrift server when start it?> > >> > > Best,> > > --> > > [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu*> > > Sr. Engineer I, Data> > > +86 18565739171> > >> > > [image: in1552694272.png] <https://www.linkedin.com/company/vungle>> > > [image:> > > fb1552694203.png] <https://facebook.com/vungle> [image:> > > tw1552694330.png] <https://twitter.com/vungle> [image:> > > ig1552694392.png] <https://www.instagram.com/vungle>> > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China> > >> >
