Re:Re: The mapping relationship between Checkpoint subtask id and Task subtask id

2020-02-13 Thread Jiayi Liao



Hi Zhijiang,




It did confuses us when we’re tring to locate the unfinished subtask in 
Checkpoint UI last time. I’ve created an issue[1] for this. 

@杨东晓 Do you have time to work on this?




[1]. https://issues.apache.org/jira/browse/FLINK-16051




Best Regards,

Jiayi Liao








At 2020-02-14 10:14:27, "Zhijiang"  wrote:

If the id is not consistent in different parts, maybe it is worth creating a 
jira ticket for better improving the user experience.
If anyone wants to work on it, please ping me then I can give a hand.


Best,
Zhijiang
--
From:Yun Tang 
Send Time:2020 Feb. 14 (Fri.) 10:52
To:杨东晓 ; user 
Subject:Re: The mapping relationship between Checkpoint subtask id and Task 
subtask id


Hi


Yes, you are right. Just simply use checkpoint subtask_id -1 would find the 
corresponding task subtask_id.


Best
Yun Tang

From: 杨东晓 
Sent: Friday, February 14, 2020 10:11
To: user 
Subject: The mapping relationship between Checkpoint subtask id and Task 
subtask id
 
Hi, I'm trying to figure out the different end2end duration for each subtask id 
in checkpoint.
In flink web ui I noticed  for job task subtask id it start from 0 and for 
checkpoint  subtask id it start from number 1,.
How can I find out which checkpoint subtask id belongs to which job task 
subtask id, just simply use checkpoint subtask ID -1 will be ok?
Thanks



Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-08 Thread Jiayi Liao
Hi Yun,

Thanks for your detailed description about the progress of Flink and
RocksDB's community. There're more than 1,200 jobs using RocksDB as the
state backend at Bytedance, and we do met several problems mentioned in the
JIRA issues you referred:

(1) Memory Management: for large-scale jobs(10TB+ state), it's hard to tune
the memory usage due to non-restrict memory control on RocksDB. And
currently we have to manually estimate the memory usage based on RocksDB's
wiki, which increases our maintainence's cost a lot.
(1) DeleteRange Support: we've made a few benchmarks on the performance of
rescaling and found out the time cost is up to a few minutes when a task's
state is larger than 10GB. I'm glad to see such improvements being merged
after upgrading RocksDB's version.
(3) ARM support: we've supported ARM's platform on our own last year by
some hacking on the codes, and it's great to see that RocksDB has an
official release on ARM platform.

I think the new features(or bugfix) are more important for us and I'm +1
for this.


Best,
Jiayi Liao

On Thu, Aug 5, 2021 at 1:50 AM Yun Tang  wrote:

> Hi Yuval,
>
> Upgrading RocksDB version is a long story since Flink-1.10.
> When we first plan to introduce write buffer manager to help control the
> memory usage of RocksDB, we actually wanted to bump up to RocksDB-5.18 from
> current RocksDB-5.17. However, we found performance regression in our micro
> benchmark on state operations [1] if bumped to RocksDB-5.18. We did not
> figure the root cause at that time and decide to cherry pick the commits of
> write buffer manager to our own FRocksDB [2]. And we finally released our
> own frocksdbjni-5.17.2-artisans-2.0 at that time.
>
> As time goes no, more and more bugs or missed features have been reported
> in the old RocksDB version. Such as:
>
>1. Cannot support ARM platform [3]
>2. Dose not have stable deleteRange API, which is useful for Flink
>scale out [4]
>3. Cannot support strict block cache [5]
>4. Checkpoint might stuck if using UNIVERSVAL compaction strategy [6]
>5. Uncontrolled log size make us disabled the RocksDB internal LOG [7]
>6. RocksDB's optimizeForPointLookup option might cause data lost [8]
>7. Current dummy entry used for memory control in RocksDB-5.17 is too
>large, leading performance problem [9]
>8. Cannot support alpine-based images.
>9. ...
>
> Some of the bugs are walked around, and some are still open.
>
> And we decide to make some changes from Flink-1.12. First of all, we
> reported the performance regression problem compared with RocksDB-5.18 and
> RocksDB-5.17 to RocksDB community [10]. However, as RocksDB-5.x versions
> are a bit older for the community, and RocksJava usage might not be the
> core part for facebook guys, we did not get useful replies. Thus, we decide
> to figure out the root cause of performance regression by ourself.
> Fortunately, we find the cause via binary search the commits among
> RocksDB-5.18 and RocksDB-5.17, and updated in the original thread [10]. To
> be short, the performance regression is due to different implementation of
> `__thread` and `thread_local` in gcc and would have more impact on dynamic
> loading [11], which is also what current RocksJava jar package does. With
> my patch [12], the performance regression would disappear if comparing
> RocksDB-5.18 with RocksDB-5.17.
>
> Unfortunately, RocksDB-5.18 still has many bugs and we want to bump to
> RocksDB-6.x. However, another performance regression appeared even with my
> patch [12]. With previous knowledge, we know that we must verify the built
> .so files with our java-based benchmark instead of using RocksDB built-in
> db-bench. I started to search the 1340+ commits from RocksDB-5.18 to
> RocksDB-6.11 to find the performance problem. However, I did not figure out
> the root cause after spending several weeks this time. The performance
> behaves up and down in those commits and I cannot get *the commit *which
> lead the performance regression. Take this commit of integrating block
> cache tracer in block-based table reader [13] for example, I noticed that
> this commit would cause a bit performance regression and that might be the
> useless usage accounting in operations, however, the problematic code was
> changed in later commits. Thus, after several weeks digging, I have to give
> up for the endless searching in the thousand commits temporarily. As
> RocksDB community seems not make the project management system public,
> unlike Apache's open JIRA systems, we do not know what benchmark they
> actually run before releasing each version to guarantee the performance.
>
> With my patch [10] on latest RocksDB-6.20.3, we could get the results

Re:Apache Flink - Operator name and uuid best practices

2019-11-16 Thread Jiayi Liao
Hi Mans!




Firstly let’s see how operator’s name and uid is used. AFAIK, operator’s name 
is used in WebUI and metrics reporting, and uid is used to mark the uniqueness 
of operator which is useful when you’re using state[1].




> Are there any restrictions on the length of the name and uuid attributes?

It’s pretty much the same as you define a string value, so there is no special 
restrictions on this.




> Are there any restrictions on the characters used for name and uuid (blank 
> spaces, etc) ?

I’m not a hundred percent sure about this but I run a testing program and it 
works fine.




> Can the name and uuid be the same ? 

Yes. But uuids accross operators cannot be same.




For me I usually set name and uuid for almost every operator, which gives me 
better experience in monitoring and scaling.




Hope this helps.







[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#matching-operator-state







Best,

Jiayi Liao




At 2019-11-16 18:35:38, "M Singh"  wrote:

Hi:


I am working on a project and wanted to find out what are the best practices 
for setting name and uuid for operators:


1. Are there any restrictions on the length of the name and uuid attributes ?
2. Are there any restrictions on the characters used for name and uuid (blank 
spaces, etc) ?
3. Can the name and uuid be the same ?


Please let me know if there is any other advice.


Thanks


Mans

Re:Cron style for checkpoint

2019-11-20 Thread Jiayi Liao
Hi Shuwen,




As far as I know, Flink can only support checkpoint with a fixed interval. 




However I think the flexible mechanism of triggering checkpoint is worth 
working on, at least from my perspective. And it may not only be a cron style. 
In our business scenario, the data traffic usually reaches the peek of the day 
after 20:00, which we want to increase the interval of checkpoint otherwise 
it’ll introduce more disk and network IO.




Just want to share something about this :)







Best,

Jiayi Liao




At 2019-11-21 10:20:47, "shuwen zhou"  wrote:
>Hi Community,
>I would like to know if there is a existing function to support cron style
>checkpoint?
>The case is, our data traffic is huge on HH:30 for each hour. We don't wont
>checkpoint to fall in that range of time. A cron like 15,45 * * * * to set
>for checkpoint would be nice. If a checkpoint is already in progress when
>minutes is 15 or 45, there would be a config value to trigger a new
>checkpoint or pass.
>
>-- 
>Best Wishes,
>Shuwen Zhou <http://www.linkedin.com/pub/shuwen-zhou/57/55b/599/>