Yes, fixed On Thu, May 9, 2019 at 6:13 AM Vinoth Chandar <[email protected]> wrote:
> Images don't render on the mailing list. :( > Seems like the issue if fixed now? > > On Tue, May 7, 2019 at 10:15 PM Jun Zhu <[email protected]> > wrote: > > > Hi, > > I run the new code pull from master branch, and compare with another > > stream job which run hudi 0.4.5 on maven. Both running per 10 minutes. > > The roll-back worked. > > Top is 0.4.5, bottom is 0.4.6 > > [image: Screen Shot 2019-05-08 at 1.06.17 PM.png] > > [image: Screen Shot 2019-05-08 at 1.06.54 PM.png] > > And about log, I rewrite the log.trace to log.error to avoid log explode > > with trace. > > And There is nothing in variable: > > > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: insert failed with 1 > errors > >> : > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Printing out the top 100 > >> errors > >> ....spark log.... > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Global error : > >> .....spark log.... > >> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: insert failed with 1 > errors > >> : > >> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: Printing out the top 100 > >> errors > > > > > > Thanks, > > Jun > > > > On Sat, May 4, 2019 at 11:31 AM Vinoth Chandar <[email protected]> > wrote: > > > >> No worries. This just landed on master, you can give it a shot. You ll > >> also > >> end up picking up interval tree based filtering for global index, which > >> will speed things along a lot. Fyi > >> > >> Have a good holiday! > >> > >> Thanks > >> Vinoth > >> > >> On Fri, May 3, 2019 at 7:19 PM Jun Zhu <[email protected]> > >> wrote: > >> > >> > Hi team, > >> > i will try that, thank you so much, sorry for late reply, just have a > >> > holiday in china😅. > >> > Thanks > >> > Jun > >> > > >> > On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar <[email protected]> > >> wrote: > >> > > >> > > Hi Jun, > >> > > > >> > > I was able to track that the HoodieSparkSQLWriter (common path for > >> > > streaming sink and batch datasource) ends up calling > >> > > DataSourceUtils.createHoodieClient, which creates the client as > >> follows > >> > > > >> > > return new HoodieWriteClient<>(jssc, writeConfig); > >> > > > >> > > There is a third parameter that denotes whether the writer needs to > >> > > rollback inflights. For e.g, DeltaStreamer invokes > >> > > > >> > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, > >> > true); > >> > > > >> > > While I trace down why we had this difference, could you try > changing > >> > this > >> > > one line here, and add third "true" argument and give it a shot. > >> > > > >> > > > >> > > >> > https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 > >> > > > >> > > > >> > > Thanks > >> > > Vinoth > >> > > > >> > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] < > >> [email protected]> > >> > > wrote: > >> > > > >> > > > > >> > > > Hi Jun, > >> > > > You had mentioned that you are seeing the log message > >> > > > "insert failed with 1 errors" > >> > > > Did you see any exception stack traces before this message. You > can > >> > also > >> > > > take a look at spark UI to see if stdout/stderr of failed tasks > (if > >> > > > present). > >> > > > Also, it looks like if you also enable "trace" level logging, you > >> would > >> > > > see exceptions getting logged at the end. So, enabling "trace" > level > >> > > > logging is another way to debug what is happening. > >> > > > '''log.error(s"$operation failed with ${errorCount} errors :"); > >> > > > if (log.isTraceEnabled) { > >> > > > log.trace("Printing out the top 100 errors") ....... > >> > > > ''' > >> > > > Balaji.V > >> > > > > >> > > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < > >> > > > [email protected]> wrote: > >> > > > > >> > > > Hi Jun, > >> > > > > >> > > > Basically you are saying streaming path leaves some inflights > >> behind.. > >> > > let > >> > > > me see if I can reproduce it. If you have a simple test case, > please > >> > > share > >> > > > > >> > > > Thanks > >> > > > Vinoth > >> > > > > >> > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > <[email protected] > >> > > >> > > > wrote: > >> > > > > >> > > > > Hi Vinoth, > >> > > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR > >> > > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no > >> continue > >> > > > error > >> > > > > logs) , during which commit end with inflight and not cleaned. > >> > > > > Just for feedback, we can dedup data correctly in batch way. > >> Should > >> > add > >> > > > > more logic for exception handling if using spark stream I think. > >> > > > > Regards, > >> > > > > Jun > >> > > > > > >> > > > > > >> > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar < > [email protected] > >> > > >> > > > wrote: > >> > > > > > >> > > > > > Another option to try would be setting the > >> > > > > > spark.sql.hive.convertMetastoreParquet=false, if you are > >> querying > >> > via > >> > > > the > >> > > > > > Hive table registered by Hudi. > >> > > > > > > >> > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu > >> <[email protected] > >> > > > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Thanks for explanation vinoth, code was same list in > >> > > > > > > https://github.com/apache/incubator-hudi/issues/639, with > >> > setting > >> > > > > table > >> > > > > > > format to > >> `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > >> > > > > > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`. > >> > > > > > > And the result data was stored on aws s3. > >> > > > > > > I will try more on > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > >> > > > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], > >> > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);` from the > >> phenomenon, > >> > > the > >> > > > > > > config did not take effects maybe. > >> > > > > > > > >> > > > > > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar < > >> > [email protected] > >> > > > > >> > > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi, > >> > > > > > > > > >> > > > > > > > >>The duplicates was found in inflight commit parquet > files. > >> > > > > Wondering > >> > > > > > if > >> > > > > > > > this was expected? > >> > > > > > > > Spark shell should not even be reading in-flight parquet > >> files. > >> > > Can > >> > > > > you > >> > > > > > > > double check if the spark access is properly configured? > >> > > > > > > > http://hudi.apache.org/querying_data.html#spark > >> > > > > > > > > >> > > > > > > > Inflight should be rolled back at the start of the next > >> > > > commit/delta > >> > > > > > > > commit.. Not sure why there are so many inflight delta > >> commits. > >> > > > > > > > If you can give a reproducible case, happy to debug it > >> more.. > >> > > > > > > > > >> > > > > > > > Only complete instants are archived.. So yes, inflight is > >> not > >> > > > > > archived.. > >> > > > > > > > > >> > > > > > > > Hope that helps > >> > > > > > > > > >> > > > > > > > Thanks > >> > > > > > > > Vinoth > >> > > > > > > > > >> > > > > > > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu > >> > > <[email protected] > >> > > > > > >> > > > > > > > wrote: > >> > > > > > > > > >> > > > > > > > > Hi Vinoth, > >> > > > > > > > > Some continue question about this thread. > >> > > > > > > > > Here is what I found after running a few days: > >> > > > > > > > > in .hoodie folder, due to retain policy maybe, there is > an > >> > > > > obviously > >> > > > > > > > > line(list in the end of email). Before it the cleaned > >> commit > >> > > was > >> > > > > > > > archived, > >> > > > > > > > > find duplication when query inflight commit correspond > >> > > partition > >> > > > by > >> > > > > > > > > spark-shell. After the line, all behave normal, global > >> dedup > >> > > > works. > >> > > > > > > > > The duplicates was found in inflight commit parquet > files. > >> > > > > Wondering > >> > > > > > if > >> > > > > > > > > this was expected? > >> > > > > > > > > Q: > >> > > > > > > > > 1. The inflight commit should be turned to roll back > >> status > >> > in > >> > > > > next > >> > > > > > > > > writes. Is it normal that so many inflight commit did > not > >> > make > >> > > > it? > >> > > > > Or > >> > > > > > > > can I > >> > > > > > > > > config a retain policy to turn inflight to roll_back in > >> > another > >> > > > > way? > >> > > > > > > > > 2. Did commit retain policy do not archive inflight > >> commit? > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 20:23:47 378 > >> > > > 20190423122339.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 20:43:53 378 > >> > > > 20190423124343.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 22:14:04 378 > >> > > > 20190423141354.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 22:44:09 378 > >> > > > 20190423144400.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 22:54:18 378 > >> > > > 20190423145408.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 23:04:09 378 > >> > > > 20190423150400.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > 2019-04-23 23:24:30 378 > >> > > > 20190423152421.deltacommit.inflight > >> > > > > > > > > > >> > > > > > > > > *2019-04-23 23:44:34 378 > >> > > > > 20190423154424.deltacommit.inflight* > >> > > > > > > > > > >> > > > > > > > > *2019-04-24 00:15:46 2991 20190423161431.clean* > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:15:21 870536 20190423161431.deltacommit > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:25:19 2991 20190423162424.clean > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:25:09 875825 20190423162424.deltacommit > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:35:26 2991 20190423163429.clean > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:35:18 881925 20190423163429.deltacommit > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:46:14 2991 20190423164428.clean > >> > > > > > > > > > >> > > > > > > > > 2019-04-24 00:45:44 888025 20190423164428.deltacommit > >> > > > > > > > > > >> > > > > > > > > Thanks, > >> > > > > > > > > Jun > >> > > > > > > > > > >> > > > > > > > > On 2019/04/18 14:29:23, Vinoth Chandar <[email protected] > > > >> > > wrote: > >> > > > > > > > > > Hi Jun,> > >> > > > > > > > > > > >> > > > > > > > > > Responses below.> > >> > > > > > > > > > > >> > > > > > > > > > >>1. Some file inflight may never reach commit?> > >> > > > > > > > > > yes. the next attempt at writing will first issue a > >> > rollback > >> > > to > >> > > > > > clean > >> > > > > > > > up> > >> > > > > > > > > > such partial/leftover files first, before it begins > the > >> new > >> > > > > > commit.> > >> > > > > > > > > > > >> > > > > > > > > > >>2. In occasion which inflight and parquet file > >> generated > >> > by > >> > > > > > > inflight > >> > > > > > > > > still> > >> > > > > > > > > > exists, the global dedup will not dedup based on such > >> kind > >> > > > > file?> > >> > > > > > > > > > even if not rolled back, we check for the inflight > >> parquet > >> > > > files > >> > > > > > > > against> > >> > > > > > > > > > the committed timeline, which it wont be a part of. So > >> > should > >> > > > be > >> > > > > > > safe.> > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >>3. In occasion which inflight and parquet file > >> generated > >> > by > >> > > > > > > inflight > >> > > > > > > > > still> > >> > > > > > > > > > exists, the correct query result will be decided by > >> read > >> > > > > config(I> > >> > > > > > > > > > mean mapreduce.input.pathFilter.class> > >> > > > > > > > > > in sparksql)> > >> > > > > > > > > > yes. the filtering should work as well. its the same > >> > > technique > >> > > > > used > >> > > > > > > by> > >> > > > > > > > > > writer.> > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >>4. Is there any way we can use> > >> > > > > > > > > > > >> > > > > > > > > > >> > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",> > >> > > > > > > > > > >> > > > > > > > > > > > >> classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],> > >> > > > > > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);> > >> > > > > > > > > > > >> > > > > > > > > > in spark thrift server when start it?> > >> > > > > > > > > > > >> > > > > > > > > > I am not familiar with the Spark thrift server myself. > >> Any > >> > > > > pointers > >> > > > > > > > where > >> > > > > > > > > I> > >> > > > > > > > > > can learn more?> > >> > > > > > > > > > Two suggestions :> > >> > > > > > > > > > - You can check if you can add this to the Hadoop > >> > > configuration > >> > > > > xml > >> > > > > > > > > files> > >> > > > > > > > > > and see if it gets picked up by Spark?> > >> > > > > > > > > > - Alternatively, you can set the spark config > mentioned > >> > here> > >> > > > > > > > > > > http://hudi.apache.org/querying_data.html#spark-rt-view > >> > > (works > >> > > > > for > >> > > > > > > ro > >> > > > > > > > > view> > >> > > > > > > > > > also), which should be doable I am assuming at this > >> thrift > >> > > > > server> > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > Thanks> > >> > > > > > > > > > Vinoth> > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu > >> > > > > <[email protected] > >> > > > > > > > >> > > > > > > > > wrote:> > >> > > > > > > > > > > >> > > > > > > > > > > Hi,> > >> > > > > > > > > > > Link: > >> > https://github.com/apache/incubator-hudi/issues/639> > >> > > > > > > > > > > Sorry , failed open > >> > > > > > > > > https://lists.apache.org/[email protected]> > >> > > > > > > > > > > .> > >> > > > > > > > > > > I have some follow up questions for issue 639:> > >> > > > > > > > > > >> > >> > > > > > > > > > > So, the sequence of events is . We write parquet > files > >> > and > >> > > > then > >> > > > > > > upon> > >> > > > > > > > > > > > successful writing of all attempted parquet files, > >> we > >> > > > > actually > >> > > > > > > make > >> > > > > > > > > the> > >> > > > > > > > > > > > commit as completed. (i.e not inflight anymore). > So > >> > this > >> > > is > >> > > > > > > normal. > >> > > > > > > > > This> > >> > > > > > > > > > > is> > >> > > > > > > > > > > > done to prevent queries from reading partially > >> written > >> > > > > parquet > >> > > > > > > > > files..> > >> > > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > > Does that mean:> > >> > > > > > > > > > > 1. Some file inflight may never reach commit?> > >> > > > > > > > > > > 2. In occasion which inflight and parquet file > >> generated > >> > by > >> > > > > > > inflight > >> > > > > > > > > still> > >> > > > > > > > > > > exists, the global dedup will not dedup based on > such > >> > kind > >> > > > > > file?> > >> > > > > > > > > > > 3. In occasion which inflight and parquet file > >> generated > >> > by > >> > > > > > > inflight > >> > > > > > > > > still> > >> > > > > > > > > > > exists, the correct query result will be decided by > >> read > >> > > > > > config(I> > >> > > > > > > > > > > mean mapreduce.input.pathFilter.class> > >> > > > > > > > > > > in sparksql)> > >> > > > > > > > > > > 4. Is there any way we can use> > >> > > > > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",> > >> > > > > > > > > > >> > > > > > > > > > > > > >> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],> > >> > > > > > > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);> > >> > > > > > > > > > >> > >> > > > > > > > > > > in spark thrift server when start it?> > >> > > > > > > > > > >> > >> > > > > > > > > > > Best,> > >> > > > > > > > > > > --> > >> > > > > > > > > > > [image: vshapesaqua11553186012.gif] < > >> https://vungle.com/ > >> > > > >> > > > > *Jun > >> > > > > > > > Zhu*> > >> > > > > > > > > > > Sr. Engineer I, Data> > >> > > > > > > > > > > +86 18565739171> > >> > > > > > > > > > >> > >> > > > > > > > > > > [image: in1552694272.png] < > >> > > > > > https://www.linkedin.com/company/vungle > >> > > > > > > >> > >> > > > > > > > > > > [image:> > >> > > > > > > > > > > fb1552694203.png] <https://facebook.com/vungle> > >> > > > [image:> > >> > > > > > > > > > > tw1552694330.png] <https://twitter.com/vungle> > >> > > [image:> > >> > > > > > > > > > > ig1552694392.png] <https://www.instagram.com/vungle > >> > >> > > > > > > > > > > Units 3801, 3804, 38F, C Block, Beijing Yintai > Center, > >> > > > Beijing, > >> > > > > > > > China> > >> > > > > > > > > > >> > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > -- > >> > > > > > > [image: vshapesaqua11553186012.gif] <https://vungle.com/> > >> *Jun > >> > > Zhu* > >> > > > > > > Sr. Engineer I, Data > >> > > > > > > +86 18565739171 > >> > > > > > > > >> > > > > > > [image: in1552694272.png] < > >> > https://www.linkedin.com/company/vungle > >> > > > > >> > > > > > > [image: > >> > > > > > > fb1552694203.png] <https://facebook.com/vungle> > [image: > >> > > > > > > tw1552694330.png] <https://twitter.com/vungle> [image: > >> > > > > > > ig1552694392.png] <https://www.instagram.com/vungle> > >> > > > > > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, > >> Beijing, > >> > > China > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun > >> Zhu* > >> > > > > Sr. Engineer I, Data > >> > > > > +86 18565739171 > >> > > > > > >> > > > > [image: in1552694272.png] < > >> https://www.linkedin.com/company/vungle> > >> > > > > [image: > >> > > > > fb1552694203.png] <https://facebook.com/vungle> [image: > >> > > > > tw1552694330.png] <https://twitter.com/vungle> [image: > >> > > > > ig1552694392.png] <https://www.instagram.com/vungle> > >> > > > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, > >> China > >> > > > > > >> > > > >> > > >> > > >> > -- > >> > [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* > >> > Sr. Engineer I, Data > >> > +86 18565739171 > >> > > >> > [image: in1552694272.png] <https://www.linkedin.com/company/vungle> > >> > [image: > >> > fb1552694203.png] <https://facebook.com/vungle> [image: > >> > tw1552694330.png] <https://twitter.com/vungle> [image: > >> > ig1552694392.png] <https://www.instagram.com/vungle> > >> > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China > >> > > >> > > > > > > -- > > [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* > > Sr. Engineer I, Data > > +86 18565739171 > > > > [image: in1552694272.png] <https://www.linkedin.com/company/vungle> > [image: > > fb1552694203.png] <https://facebook.com/vungle> [image: > > tw1552694330.png] <https://twitter.com/vungle> [image: > > ig1552694392.png] <https://www.instagram.com/vungle> > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China > > > > > -- [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* Sr. Engineer I, Data +86 18565739171 [image: in1552694272.png] <https://www.linkedin.com/company/vungle> [image: fb1552694203.png] <https://facebook.com/vungle> [image: tw1552694330.png] <https://twitter.com/vungle> [image: ig1552694392.png] <https://www.instagram.com/vungle> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
