Re: About github issue 639

Vinoth Chandar Wed, 08 May 2019 15:14:49 -0700

Images don't render on the mailing list. :(
Seems like  the issue if fixed now?


On Tue, May 7, 2019 at 10:15 PM Jun Zhu <[email protected]> wrote:

> Hi,
> I run the new code pull from master branch, and compare with another
> stream job which run hudi 0.4.5 on maven. Both running per 10 minutes.
> The roll-back worked.
> Top is 0.4.5, bottom is 0.4.6
> [image: Screen Shot 2019-05-08 at 1.06.17 PM.png]
> [image: Screen Shot 2019-05-08 at 1.06.54 PM.png]
> And about log, I rewrite the log.trace to log.error to avoid log explode
> with trace.
> And There is nothing in variable:
>
>> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: insert failed with 1 errors
>> :
>> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Printing out the top 100
>> errors
>> ....spark log....
>> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Global error :
>> .....spark log....
>> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: insert failed with 1 errors
>> :
>> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: Printing out the top 100
>> errors
>
>
> Thanks,
> Jun
>
> On Sat, May 4, 2019 at 11:31 AM Vinoth Chandar <[email protected]> wrote:
>
>> No worries. This just landed on master, you can give it a shot. You ll
>> also
>> end up picking up interval tree based filtering for global index, which
>> will speed things along a lot. Fyi
>>
>> Have a good holiday!
>>
>> Thanks
>> Vinoth
>>
>> On Fri, May 3, 2019 at 7:19 PM Jun Zhu <[email protected]>
>> wrote:
>>
>> > Hi team,
>> > i will try that, thank you so much, sorry for late reply, just have a
>> > holiday in china😅.
>> > Thanks
>> > Jun
>> >
>> > On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar <[email protected]>
>> wrote:
>> >
>> > > Hi Jun,
>> > >
>> > > I was able to track that the HoodieSparkSQLWriter (common path for
>> > > streaming sink and batch datasource) ends up calling
>> > > DataSourceUtils.createHoodieClient, which creates the client as
>> follows
>> > >
>> > > return new HoodieWriteClient<>(jssc, writeConfig);
>> > >
>> > > There is a third parameter that denotes whether the writer needs to
>> > > rollback inflights. For e.g, DeltaStreamer invokes
>> > >
>> > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg,
>> > true);
>> > >
>> > > While I trace down why we had this difference, could you try changing
>> > this
>> > > one line here, and add third "true" argument and give it a shot.
>> > >
>> > >
>> >
>> https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148
>> > >
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] <
>> [email protected]>
>> > > wrote:
>> > >
>> > > >
>> > > > Hi Jun,
>> > > > You had mentioned that you are seeing the log message
>> > > >  "insert failed with 1 errors"
>> > > > Did you see any exception stack traces before this message. You can
>> > also
>> > > > take a look at spark UI to see if stdout/stderr of failed tasks (if
>> > > > present).
>> > > > Also, it looks like if you also enable "trace" level logging, you
>> would
>> > > > see exceptions getting logged at the end. So, enabling "trace" level
>> > > > logging is another way to debug what is happening.
>> > > > '''log.error(s"$operation failed with ${errorCount} errors :");
>> > > > if (log.isTraceEnabled) {
>> > > >   log.trace("Printing out the top 100 errors")     .......
>> > > > '''
>> > > > Balaji.V
>> > > >
>> > > >     On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar <
>> > > > [email protected]> wrote:
>> > > >
>> > > >  Hi Jun,
>> > > >
>> > > > Basically you are saying streaming path leaves some inflights
>> behind..
>> > > let
>> > > > me see if I can reproduce it. If you have a simple test case, please
>> > > share
>> > > >
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu <[email protected]
>> >
>> > > > wrote:
>> > > >
>> > > > > Hi Vinoth,
>> > > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR
>> > > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no
>> continue
>> > > > error
>> > > > > logs) , during which commit end with inflight and not cleaned.
>> > > > > Just for feedback, we can dedup data correctly in batch way.
>> Should
>> > add
>> > > > > more logic for exception handling if using spark stream I think.
>> > > > > Regards,
>> > > > > Jun
>> > > > >
>> > > > >
>> > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar <[email protected]
>> >
>> > > > wrote:
>> > > > >
>> > > > > > Another option to try would be setting the
>> > > > > > spark.sql.hive.convertMetastoreParquet=false, if you are
>> querying
>> > via
>> > > > the
>> > > > > > Hive table registered by Hudi.
>> > > > > >
>> > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu
>> <[email protected]
>> > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Thanks for explanation vinoth, code was same list in
>> > > > > > > https://github.com/apache/incubator-hudi/issues/639, with
>> > setting
>> > > > > table
>> > > > > > > format to
>> `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
>> > > > > > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
>> > > > > > > And the result data was stored on aws s3.
>> > > > > > > I will try more on
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
>> > > > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
>> > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);`  from the
>> phenomenon,
>> > > the
>> > > > > > > config did not take effects maybe.
>> > > > > > >
>> > > > > > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > >>The duplicates was found in inflight commit parquet files.
>> > > > > Wondering
>> > > > > > if
>> > > > > > > > this was expected?
>> > > > > > > > Spark shell should not even be reading in-flight parquet
>> files.
>> > > Can
>> > > > > you
>> > > > > > > > double check if the spark access is properly configured?
>> > > > > > > > http://hudi.apache.org/querying_data.html#spark
>> > > > > > > >
>> > > > > > > > Inflight should be rolled back at the start of the next
>> > > > commit/delta
>> > > > > > > > commit.. Not sure why there are so many inflight delta
>> commits.
>> > > > > > > > If you can give a reproducible case, happy to debug it
>> more..
>> > > > > > > >
>> > > > > > > > Only complete instants are archived.. So yes, inflight is
>> not
>> > > > > > archived..
>> > > > > > > >
>> > > > > > > > Hope that helps
>> > > > > > > >
>> > > > > > > > Thanks
>> > > > > > > > Vinoth
>> > > > > > > >
>> > > > > > > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu
>> > > <[email protected]
>> > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Vinoth,
>> > > > > > > > > Some continue question about this thread.
>> > > > > > > > > Here is what I found after running a few days:
>> > > > > > > > > in .hoodie folder, due to retain policy maybe, there is an
>> > > > > obviously
>> > > > > > > > > line(list in the end of email). Before it the cleaned
>> commit
>> > > was
>> > > > > > > > archived,
>> > > > > > > > > find duplication when query inflight commit correspond
>> > > partition
>> > > > by
>> > > > > > > > > spark-shell. After the line, all behave normal, global
>> dedup
>> > > > works.
>> > > > > > > > > The duplicates was found in inflight commit parquet files.
>> > > > > Wondering
>> > > > > > if
>> > > > > > > > > this was expected?
>> > > > > > > > > Q:
>> > > > > > > > > 1.  The inflight commit should be turned to roll back
>> status
>> > in
>> > > > > next
>> > > > > > > > > writes. Is it normal that so many inflight commit did not
>> > make
>> > > > it?
>> > > > > Or
>> > > > > > > > can I
>> > > > > > > > > config a retain policy to turn inflight to roll_back in
>> > another
>> > > > > way?
>> > > > > > > > > 2. Did commit retain policy do not archive inflight
>> commit?
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 20:23:47        378
>> > > > 20190423122339.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 20:43:53        378
>> > > > 20190423124343.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 22:14:04        378
>> > > > 20190423141354.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 22:44:09        378
>> > > > 20190423144400.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 22:54:18        378
>> > > > 20190423145408.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 23:04:09        378
>> > > > 20190423150400.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > 2019-04-23 23:24:30        378
>> > > > 20190423152421.deltacommit.inflight
>> > > > > > > > >
>> > > > > > > > > *2019-04-23 23:44:34        378
>> > > > > 20190423154424.deltacommit.inflight*
>> > > > > > > > >
>> > > > > > > > > *2019-04-24 00:15:46      2991 20190423161431.clean*
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:15:21    870536 20190423161431.deltacommit
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:25:19      2991 20190423162424.clean
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:25:09    875825 20190423162424.deltacommit
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:35:26      2991 20190423163429.clean
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:35:18    881925 20190423163429.deltacommit
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:46:14      2991 20190423164428.clean
>> > > > > > > > >
>> > > > > > > > > 2019-04-24 00:45:44    888025 20190423164428.deltacommit
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Jun
>> > > > > > > > >
>> > > > > > > > > On 2019/04/18 14:29:23, Vinoth Chandar <[email protected]>
>> > > wrote:
>> > > > > > > > > > Hi Jun,>
>> > > > > > > > > >
>> > > > > > > > > > Responses below.>
>> > > > > > > > > >
>> > > > > > > > > > >>1. Some file inflight may never reach commit?>
>> > > > > > > > > > yes. the next attempt at writing will first issue a
>> > rollback
>> > > to
>> > > > > > clean
>> > > > > > > > up>
>> > > > > > > > > > such partial/leftover files first, before it begins the
>> new
>> > > > > > commit.>
>> > > > > > > > > >
>> > > > > > > > > > >>2. In occasion which inflight and parquet file
>> generated
>> > by
>> > > > > > > inflight
>> > > > > > > > > still>
>> > > > > > > > > > exists,  the global dedup will not dedup based on such
>> kind
>> > > > > file?>
>> > > > > > > > > > even if not rolled back, we check for the inflight
>> parquet
>> > > > files
>> > > > > > > > against>
>> > > > > > > > > > the committed timeline, which it wont be a part of. So
>> > should
>> > > > be
>> > > > > > > safe.>
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > >>3. In occasion which inflight and parquet file
>> generated
>> > by
>> > > > > > > inflight
>> > > > > > > > > still>
>> > > > > > > > > > exists,  the correct query result will be decided by
>> read
>> > > > > config(I>
>> > > > > > > > > > mean mapreduce.input.pathFilter.class>
>> > > > > > > > > > in sparksql)>
>> > > > > > > > > > yes. the filtering should work as well. its the same
>> > > technique
>> > > > > used
>> > > > > > > by>
>> > > > > > > > > > writer.>
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > >>4. Is there any way we can use>
>> > > > > > > > > >
>> > > > > > > > > > >>
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
>> > > > > > > > >
>> > > > > > > > > > >
>> classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
>> > > > > > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);>
>> > > > > > > > > >
>> > > > > > > > > > in spark thrift server when start it?>
>> > > > > > > > > >
>> > > > > > > > > > I am not familiar with the Spark thrift server myself.
>> Any
>> > > > > pointers
>> > > > > > > > where
>> > > > > > > > > I>
>> > > > > > > > > > can learn more?>
>> > > > > > > > > > Two suggestions :>
>> > > > > > > > > > - You can check if you can add this to the Hadoop
>> > > configuration
>> > > > > xml
>> > > > > > > > > files>
>> > > > > > > > > > and see if it gets picked up by Spark?>
>> > > > > > > > > > - Alternatively, you can set the spark config mentioned
>> > here>
>> > > > > > > > > > http://hudi.apache.org/querying_data.html#spark-rt-view
>> > > (works
>> > > > > for
>> > > > > > > ro
>> > > > > > > > > view>
>> > > > > > > > > > also), which should be doable I am assuming at this
>> thrift
>> > > > > server>
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Thanks>
>> > > > > > > > > > Vinoth>
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu
>> > > > > <[email protected]
>> > > > > > >
>> > > > > > > > > wrote:>
>> > > > > > > > > >
>> > > > > > > > > > > Hi,>
>> > > > > > > > > > > Link:
>> > https://github.com/apache/incubator-hudi/issues/639>
>> > > > > > > > > > > Sorry , failed open
>> > > > > > > > > https://lists.apache.org/[email protected]>
>> > > > > > > > > > > .>
>> > > > > > > > > > > I have some follow up questions for issue 639:>
>> > > > > > > > > > >>
>> > > > > > > > > > > So, the sequence of events is . We write parquet files
>> > and
>> > > > then
>> > > > > > > upon>
>> > > > > > > > > > > > successful writing of all attempted parquet files,
>> we
>> > > > > actually
>> > > > > > > make
>> > > > > > > > > the>
>> > > > > > > > > > > > commit as completed. (i.e not inflight anymore). So
>> > this
>> > > is
>> > > > > > > normal.
>> > > > > > > > > This>
>> > > > > > > > > > > is>
>> > > > > > > > > > > > done to prevent queries from reading partially
>> written
>> > > > > parquet
>> > > > > > > > > files..>
>> > > > > > > > > > > >>
>> > > > > > > > > > >>
>> > > > > > > > > > > Does that mean:>
>> > > > > > > > > > > 1. Some file inflight may never reach commit?>
>> > > > > > > > > > > 2. In occasion which inflight and parquet file
>> generated
>> > by
>> > > > > > > inflight
>> > > > > > > > > still>
>> > > > > > > > > > > exists,  the global dedup will not dedup based on such
>> > kind
>> > > > > > file?>
>> > > > > > > > > > > 3. In occasion which inflight and parquet file
>> generated
>> > by
>> > > > > > > inflight
>> > > > > > > > > still>
>> > > > > > > > > > > exists,  the correct query result will be decided by
>> read
>> > > > > > config(I>
>> > > > > > > > > > > mean mapreduce.input.pathFilter.class>
>> > > > > > > > > > > in sparksql)>
>> > > > > > > > > > > 4. Is there any way we can use>
>> > > > > > > > > > >>
>> > > > > > > > > > > >>
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
>> > > > > > > > >
>> > > > > > > > > > > >
>> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
>> > > > > > > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);>
>> > > > > > > > > > >>
>> > > > > > > > > > > in spark thrift server when start it?>
>> > > > > > > > > > >>
>> > > > > > > > > > > Best,>
>> > > > > > > > > > > -->
>> > > > > > > > > > > [image: vshapesaqua11553186012.gif] <
>> https://vungle.com/
>> > >
>> > > > >  *Jun
>> > > > > > > > Zhu*>
>> > > > > > > > > > > Sr. Engineer I, Data>
>> > > > > > > > > > > ＋86 18565739171>
>> > > > > > > > > > >>
>> > > > > > > > > > > [image: in1552694272.png] <
>> > > > > > https://www.linkedin.com/company/vungle
>> > > > > > > >>
>> > > > > > > > > > > [image:>
>> > > > > > > > > > > fb1552694203.png] <https://facebook.com/vungle>
>> > > > [image:>
>> > > > > > > > > > > tw1552694330.png] <https://twitter.com/vungle>
>> > > [image:>
>> > > > > > > > > > > ig1552694392.png] <https://www.instagram.com/vungle>>
>> > > > > > > > > > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center,
>> > > > Beijing,
>> > > > > > > > China>
>> > > > > > > > > > >>
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > [image: vshapesaqua11553186012.gif] <https://vungle.com/>
>> *Jun
>> > > Zhu*
>> > > > > > > Sr. Engineer I, Data
>> > > > > > > ＋86 18565739171
>> > > > > > >
>> > > > > > > [image: in1552694272.png] <
>> > https://www.linkedin.com/company/vungle
>> > > >
>> > > > > > > [image:
>> > > > > > > fb1552694203.png] <https://facebook.com/vungle>      [image:
>> > > > > > > tw1552694330.png] <https://twitter.com/vungle>      [image:
>> > > > > > > ig1552694392.png] <https://www.instagram.com/vungle>
>> > > > > > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center,
>> Beijing,
>> > > China
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > [image: vshapesaqua11553186012.gif] <https://vungle.com/>  *Jun
>> Zhu*
>> > > > > Sr. Engineer I, Data
>> > > > > ＋86 18565739171
>> > > > >
>> > > > > [image: in1552694272.png] <
>> https://www.linkedin.com/company/vungle>
>> > > > > [image:
>> > > > > fb1552694203.png] <https://facebook.com/vungle>      [image:
>> > > > > tw1552694330.png] <https://twitter.com/vungle>      [image:
>> > > > > ig1552694392.png] <https://www.instagram.com/vungle>
>> > > > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing,
>> China
>> > > > >
>> > >
>> >
>> >
>> > --
>> > [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
>> > Sr. Engineer I, Data
>> > ＋86 18565739171
>> >
>> > [image: in1552694272.png] <https://www.linkedin.com/company/vungle>
>> > [image:
>> > fb1552694203.png] <https://facebook.com/vungle>      [image:
>> > tw1552694330.png] <https://twitter.com/vungle>      [image:
>> > ig1552694392.png] <https://www.instagram.com/vungle>
>> > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>> >
>>
>
>
> --
> [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
> Sr. Engineer I, Data
> ＋86 18565739171
>
> [image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
> fb1552694203.png] <https://facebook.com/vungle>      [image:
> tw1552694330.png] <https://twitter.com/vungle>      [image:
> ig1552694392.png] <https://www.instagram.com/vungle>
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>
>

Re: About github issue 639

Reply via email to