(wangzhijiang999)
抄 送:jpreisner ; user
主 题:Re: Need help to understand memory consumption
Hi Zhijiang,
Does the memory management apply to streaming jobs as well? A previous post[1]
said that it can only be used in batch API, but I might miss some updates on
that. Thank you!
[1] https
Hi Julien,
Flink would manage the default 70% fraction of free memory in TaskManager for
caching data efficiently, just as you mentioned in this article
"https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html";.
These managed memories are persistent resident and referenced b
The channels are mapped to the subpartition index which would be consumed by
specific downstream task parallelism.
For example, if there are three reduce tasks parallelism, every map task would
generate three subpartitions. If one record is hashed to the first channel,
that means this record wi
The checkpoint duration includes the processes of barrier alignment and state
snapshot. Every task has to receive all the barriers from all the channels,
then trriger to snapshot state.
I guess the barrier alignment may take long time for your case, and it is
specially critical during backpressu
There actually exists this deadlock for special scenarios.
Before fixing the bug, we can avoid this issue by not deploying the map and
sink tasks in the same task manager to work around.
Krishna, do you share the slot for these two tasks? If so, you can set disable
slot sharing for this job.
Or
Very agree with to drop it. +1
--
发件人:Jeff Carter
发送时间:2018年9月29日(星期六) 10:18
收件人:dev
抄 送:chesnay ; Till Rohrmann ; user
主 题:Re: [DISCUSS] Dropping flink-storm?
+1 to drop it.
On Fri, Sep 28, 2018, 7:25 PM Hequn Cheng wrote:
> H
nnel buffer is not enough, at least we should see this value to be
1. Mainly reason I care about this is i want to find a metric to monitor
this. you just mention the autoread flag. Do you think monitor this flag in
inputchannel is a good choice ?
Thanks,
aitozi.
Zhijiang(wangzhijiang999) wr
Hi,
The inpoolQueueLength indicates hown many buffers are received and queued. But
if the buffers in the queue are the events (like barrier), it will not be
calculated in the inpoolUsage.
So in your case it may be normal for these two metrics. If you monitored that
the autoread=false in downstr
Hi,
1. This rpc timeout occurs during JobMaster deploying task into TaskExecutor.
The rpc thread in TaskExecutor does not respond the deployment message within
10 seconds. There are many possibilities to cause this issue, such as network
problem between TaskExecutor and JobMaster or other time
Hi,
I think the problem in the attched image is not the root cause of your job
failure. It must exist other task or TaskManager failures, then all the related
tasks will be cancelled by job manager, and the problem in attched image is
just caused by task cancelled.
You can review the log of jo
:wangzhijiang999
抄 送:chesnay ; user
主 题:Re: Backpressure? for Batches
Thanks,
I thought either Group by is causing the OOM but it is mentioned that sort will
be spilled to disk so that there is no way for that to cause the OOM. So I was
looking maybe due to back pressure some of data read from hdfs is
(groupby in your case) or decrease the parallelism of fast node(source in
your case).
Best,
Zhijiang
--
发件人:Darshan Singh
发送时间:2018年8月29日(星期三) 18:16
收件人:chesnay
抄 送:wangzhijiang999 ; user
主 题:Re: Backpressure? for Batches
Thanks, Now
The backpressure is caused when downstream and upstream are running
concurrently, and the downstream is slower than the upstream.
In stream job, the schedule mode will schedule both sides concurrently, so the
backpressure may exist.
As for batch job, the default schedule mode is LAZY_FROM_SOURCE
Hi,
How do you reduce the speed to avoid this issue? Do you mean reducing the
parallelism of source or downstream tasks?
As I know, data buffering is managed by flink internal buffer pool and memory
manager, so it will not cause OOM issue.
I just wonder the OOM may be caused by temporary byte b
Hi Steffen,
This exception indicates that when the downstream task requests partition from
the upstream task, the upstream task has not initialized to register its result
partition.
In this case, the downstream task will inquire the state from job manager, and
then retry to request partition fr
发送时间:2018年7月17日(星期二) 21:53
收件人:piotr
抄 送:fhueske ; wangzhijiang999 ;
user ; nico
主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory)
Yes, I'm using Flink 1.5.0 and what I'm serializing is a really big record
(probably too big, we have already started working to
framework. Also you can monitor the gc status to check the full gc delay.
Best,
Zhijiang
--
发件人:Gerard Garcia
发送时间:2018年7月13日(星期五) 16:22
收件人:wangzhijiang999
抄 送:user
主 题:Re: Flink job hangs/deadlocks (possibly related to out of m
trics for some helps.
--
发件人:Vishal Santoshi
发送时间:2018年7月6日(星期五) 22:05
收件人:Zhijiang(wangzhijiang999)
抄 送:user
主 题:Re: Limiting in flight data
Further if there is are metrics that allows us to chart delays per pipe on n/w
buffers, that would be immensely help
Hi Mich,
From flink-1.5.0 the network flow control is improved by credit-based mechanism
whichs handles backpressure better than before. The producer sends data based
on the number of available buffers(credit) onconsumer side. If processing time
on consumer side is slower than producing time on
Hi Vishal,
Before Flink-1.5.0, the sender tries best to send data on the network until the
wire is filled with data. From Flink-1.5.0 the network flow control is improved
by credit-based idea. That means the sender transfers data based on how many
buffers avaiable on receiver side, so there wil
to
trigger restarting the job.
Zhijiang
--
发件人:Gerard Garcia
发送时间:2018年7月2日(星期一) 18:29
收件人:wangzhijiang999
抄 送:user
主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory)
Thanks Zhijiang,
We haven't found any
Hi Gerard,
From the below stack, it can only indicate the task is canceled that may be
triggered by job manager becuase of other task failure. If the task can not be
interrupted within timeout config, the task managerprocess will be exited. Do
you see any OutOfMemory messages from the task
Hi Osh,
As I know, currently one dataset source can not be consumed by several
different vertexs and from the API you can not construct the topology for your
request.
I think your way to merge different reduce functions into one UDF is feasible.
Maybe someone has better solution. :)
zhijiang
Backpressure is indeed delayed the checkpoints because of gradually
accumulated inflighting network buffers before barrier alignment. As
Piotr explained, 1.5 can improve to some extent. After 1.5 we plan to
further speed the checkpoint by controlling the channel reader to imp
Hi Pawel,
The data transfer process on sender side is in the following way:operator
collect record --> serilize to flink buffer --> copy to netty buffer --> flush
to socket
On receiver side: socket --> netty --> flink buffer --> deserialize to record
--> operator process
On receiver side, if the
Based on Kurt's scenario, if the cumulator allocates a big ByteBuf from
ByteBufAllocator during expansion, it is easy to result in creating a new
PoolChunk(16M) because of no consistent memory in current PoolChunks. And this
will cause the total used direct memory beyond estimated.
For further
Hi Ray,
For your question : Does that say that each parallel task inside the
TaskManager talk to all parallel tasks inside the same TaskManager or to all
parallel tasks across all task managers? Each task will talk to all parallel
upstream and downstream tasks that both include the same TaskMana
Hi albert,
As I know, if the upstream data will be consumed by multiple consumers, it
will generate multiple subpartitions, and each subpartition will correspond to
one input channel consumer.So it is one-to-one correspondence among
subpartition -> subpartition view -> input channel.
cheers,
Hi Shannon,
Have you tried to increase the total memory size for task manager
container? Maybe the maximum memory requirement is beyond your current setting.
And also you should check your UDF would not consume memory increasingly which
would not be recycled.
If your UDF
nager
- Un-registering task and sending final execution state
CANCELED to JobManager for task Sink: /tmp/flink/events_invalid HDFS sink
(477db6e41932ad9b60c72e14de4488ed)
Best,
Jürgen
On 13.04.2017 07:05, Zhijiang(wangzhijiang999)
Hi Jürgen,
You can set the timeout in the configuration by this key
"akka.ask.timeout", and the current default value is 10 s. Hope it can help you.
cheers,zhijiang
--发件人:Jürgen
Thomann 发送时间:2017年4月12日(星期三) 19:04收件人:user
主 题:
you find the
reason for your case. Wish your sharing!
cheers,Zhijiang--发件人:Kamil
Dziublinski 发送时间:2017年4月5日(星期三)
16:07收件人:Zhijiang(wangzhijiang999) 抄 送:user
主 题:Re: PartitionNotFoundException on deploying
streaming job
Ok thanks I
Hi Kamil,
When the producer receives the PartitionRequest from downstream task,
first it will check whether the requested partition is already registered. If
not, it will reponse PartitionNotFoundException.And the upstream task is
submitted and begins to run, it will registered all its par
Hi lining , The records would be serialized and wrote into the buffer.
And the buffer will be sent out until it is full or exceeds the timeout.So
different records may share the same buffer if the serialization result is less
than the buffer size, otherwise one record may span multi buffer
-B1,A1-IntermediateResultPartition-B2,A2-IntermediateResultPartition-B1,
A2-IntermediateResultPartition-B2 in the right graph.
Cheers,
Zhijiang-发件人:lining
jing 发送时间:2017年3月15日(星期三) 10:54收件人:user
; Zhijiang(wangzhijiang999)
主 题:Re
Hi ,
I think there is no difference between JobVertex(A) and JobVertex(B).
Because the JobVertex(C) is not shown in the right graph, it may mislead
you.There should be another intermediate result partition between JobVertex(B)
and JobVertex(C) for each parallelism, and that is the same case
Hi Dominik,
As I know, the JobManager would detect the failure of TaskManager by akka
watch mechanism. It is similar with heartbeat or ping way in network stack.You
can refer to this link
"https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors";.
Futhermore, the upstream and do
The log just indicates the SignalHandler handles the kill signal and the
process of JobManager exit , and it can not get the reason from it.You may
check the container log from node manager why it was killed.
Best,
Zhijiang--发件人:lini
yes, it is really a critical problem for large batch job because the unexpected
failure is a common case. And we are already focusing on realizing the ideas
mentioned in FLIP1, wish to contirbute to flink in months.
Best,
Zhijiang--发
Celebi 发送时间:2016年5月24日(星期二) 01:19收件人:user
; wangzhijiang999 主 题:Re:
problem of sharing TCP connection when transferring data On Mon, May 23, 2016
at 6:55 PM, wangzhijiang999
wrote:
>In summary, if one task set autoread as false, and when it notify the
> available buffer, there ar
application, and wish further contact with you for the professional advise.
Thank you again!
Zhijiang
Wang--发件人:Ufuk
Celebi 发送时间:2016年5月23日(星期一) 19:49收件人:user
; wangzhijiang999 主 题:Re:
problem of sharing TCP connection when
Hi,
I am confused with sharing tcp connection for the same connectionID, if
two tasks share the same connection, and there is no available buffer in the
local buffer pool of the first task , then it will set autoread as false for
the channel, but it will effect the second task if it still
= data.map(new MyMapper());
DataSet mapped2 = data.map(new AnotherMapper());
mapped1.join(mapped2).where(...).equalTo(...);--发件人:Ufuk
Celebi 发送时间:2016年5月10日(星期二) 17:22收件人:user
; wangzhijiang999 主 题:Re:
Blocking or pipelined mode for
Hi ,
As I reviewed the flink source code, if the ExecutionMode is set
"Pipelined", the DataExechangeMode will be set "Batch" if breaking pipelined
property is true for two input or iteration situation in order to avoid
deadlock. When the DataExechangeMode is set "Batch", the ResultPartiti
Hi Chiwan,
Thank you for instant reply, when will the official flink-1.0 be
released, can you give a rough estimate? I am interested in the new feature of
flink-1.0 like operator uid in order to solve my current problem.
Regards,
Zhijiang Wang
Hi Chiwan,
Thank you for instant reply, when will the official flink-1.0 be
released, can you give a rough estimate? I am interested in the new feature of
flink-1.0 like operator uid in order to solve my current problem.
Regards,
Zhijiang Wang
Hi, Where can get the summary changes between flink-1.0 and flink-0.10,
thank you in advance!
Best Regards,
Zhijiang Wang
Hi Stephan,
Thank you for detail explaination. As you said, my opition is to keep task
still running druing jobmanager failover, even though sending update status
failed.
For the first reason you mentioned, if i understand correctly, the key issue is
status out of sync between taskmanager and
Hi, As i know, when TaskManager send UpdateTaskExecutionState to
JobManager, if the JobManager failover and the future response is fail, the
task will be failed. Is it feasible to retry send UpdateTaskExecutionState
again when future response fail until success. In JobManager HA mode, the
U
--发件人:Ufuk
Celebi 发送时间:2015年8月10日(星期一) 17:13收件人:user
,wangzhijiang999 主 题:Re:
答复:how to understand the flink flow controlGood to hear. Answers to your
questions are inline.I think we should move the discussion to the dev@flink.a.o
list though if there are more questions.On Mon, Aug 10
is right? Looking forward to your docs!
Best wishes,
Zhijiang Wang
--发件人:Ufuk
Celebi 发送时间:2015年8月7日(星期五) 21:07收件人:user
,wangzhijiang999 主 题:Re: how
to understand the flink flow controlHey Zhijiang Wang,I will update the docs
next
As said in apache page, Flink's streaming runtime has natural flow control:
Slow downstream operators backpressure faster upstream operators.How to
understand the flink natural flow control? As i know, heron has the
backpressure mechanism, if some tasks process slowly, it will stop reading from
.
--发件人:Stephan
Ewen 发送时间:2015年8月3日(星期一) 22:17收件人:user
,wangzhijiang999 主 题:Re:
答复:thread model issue in TaskManager - Communication to the TaskManager, or
directly to the JobManager - Network stack for shuffles to exchange data with
other processes (as exchanges go streaming and through
suggestions? Thank you!
--发件人:Stephan
Ewen 发送时间:2015年8月3日(星期一) 00:36收件人:user
抄 送:wangzhijiang999 主 题:Re:
thread model issue in TaskManagerHere are some additional things you can do: -
For isolation between parallel tasks (within a
As I know, flink uses thread model in TaskManager, that means one taskmanager
process may run many different operator threads,and these threads will compete
the memory of the process. I know that flink has memoryManage component in each
taskManager, and it will control the localBufferPool of Inp
55 matches
Mail list logo