回复:Need help to understand memory consumption

2018-10-16 Thread Zhijiang(wangzhijiang999)
(wangzhijiang999) 抄 送:jpreisner ; user 主 题:Re: Need help to understand memory consumption Hi Zhijiang, Does the memory management apply to streaming jobs as well? A previous post[1] said that it can only be used in batch API, but I might miss some updates on that. Thank you! [1] https

回复:Need help to understand memory consumption

2018-10-16 Thread Zhijiang(wangzhijiang999)
Hi Julien, Flink would manage the default 70% fraction of free memory in TaskManager for caching data efficiently, just as you mentioned in this article "https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html";. These managed memories are persistent resident and referenced b

回复:What are channels mapped to?

2018-10-11 Thread Zhijiang(wangzhijiang999)
The channels are mapped to the subpartition index which would be consumed by specific downstream task parallelism. For example, if there are three reduce tasks parallelism, every map task would generate three subpartitions. If one record is hashed to the first channel, that means this record wi

回复:Small checkpoint data takes too much time

2018-10-09 Thread Zhijiang(wangzhijiang999)
The checkpoint duration includes the processes of barrier alignment and state snapshot. Every task has to receive all the barriers from all the channels, then trriger to snapshot state. I guess the barrier alignment may take long time for your case, and it is specially critical during backpressu

回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2018-10-08 Thread Zhijiang(wangzhijiang999)
There actually exists this deadlock for special scenarios. Before fixing the bug, we can avoid this issue by not deploying the map and sink tasks in the same task manager to work around. Krishna, do you share the slot for these two tasks? If so, you can set disable slot sharing for this job. Or

回复:[DISCUSS] Dropping flink-storm?

2018-09-28 Thread Zhijiang(wangzhijiang999)
Very agree with to drop it. +1 -- 发件人:Jeff Carter 发送时间:2018年9月29日(星期六) 10:18 收件人:dev 抄 送:chesnay ; Till Rohrmann ; user 主 题:Re: [DISCUSS] Dropping flink-storm? +1 to drop it. On Fri, Sep 28, 2018, 7:25 PM Hequn Cheng wrote: > H

回复:回复:InpoolUsage & InpoolBuffers inconsistence

2018-09-18 Thread Zhijiang(wangzhijiang999)
nnel buffer is not enough, at least we should see this value to be 1. Mainly reason I care about this is i want to find a metric to monitor this. you just mention the autoread flag. Do you think monitor this flag in inputchannel is a good choice ? Thanks, aitozi. Zhijiang(wangzhijiang999) wr

回复:InpoolUsage & InpoolBuffers inconsistence

2018-09-17 Thread Zhijiang(wangzhijiang999)
Hi, The inpoolQueueLength indicates hown many buffers are received and queued. But if the buffers in the queue are the events (like barrier), it will not be calculated in the inpoolUsage. So in your case it may be normal for these two metrics. If you monitored that the autoread=false in downstr

回复:Flink application down due to RpcTimeout exception

2018-09-12 Thread Zhijiang(wangzhijiang999)
Hi, 1. This rpc timeout occurs during JobMaster deploying task into TaskExecutor. The rpc thread in TaskExecutor does not respond the deployment message within 10 seconds. There are many possibilities to cause this issue, such as network problem between TaskExecutor and JobMaster or other time

回复:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

2018-09-07 Thread Zhijiang(wangzhijiang999)
Hi, I think the problem in the attched image is not the root cause of your job failure. It must exist other task or TaskManager failures, then all the related tasks will be cancelled by job manager, and the problem in attched image is just caused by task cancelled. You can review the log of jo

回复:Backpressure? for Batches

2018-08-29 Thread Zhijiang(wangzhijiang999)
:wangzhijiang999 抄 送:chesnay ; user 主 题:Re: Backpressure? for Batches Thanks, I thought either Group by is causing the OOM but it is mentioned that sort will be spilled to disk so that there is no way for that to cause the OOM. So I was looking maybe due to back pressure some of data read from hdfs is

回复:Backpressure? for Batches

2018-08-29 Thread Zhijiang(wangzhijiang999)
(groupby in your case) or decrease the parallelism of fast node(source in your case). Best, Zhijiang -- 发件人:Darshan Singh 发送时间:2018年8月29日(星期三) 18:16 收件人:chesnay 抄 送:wangzhijiang999 ; user 主 题:Re: Backpressure? for Batches Thanks, Now

回复:Backpressure? for Batches

2018-08-29 Thread Zhijiang(wangzhijiang999)
The backpressure is caused when downstream and upstream are running concurrently, and the downstream is slower than the upstream. In stream job, the schedule mode will schedule both sides concurrently, so the backpressure may exist. As for batch job, the default schedule mode is LAZY_FROM_SOURCE

回复:Kryo Serialization Issue

2018-08-28 Thread Zhijiang(wangzhijiang999)
Hi, How do you reduce the speed to avoid this issue? Do you mean reducing the parallelism of source or downstream tasks? As I know, data buffering is managed by flink internal buffer pool and memory manager, so it will not cause OOM issue. I just wonder the OOM may be caused by temporary byte b

回复:Network PartitionNotFoundException when run on multi nodes

2018-07-22 Thread Zhijiang(wangzhijiang999)
Hi Steffen, This exception indicates that when the downstream task requests partition from the upstream task, the upstream task has not initialized to register its result partition. In this case, the downstream task will inquire the state from job manager, and then retry to request partition fr

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-17 Thread Zhijiang(wangzhijiang999)
发送时间:2018年7月17日(星期二) 21:53 收件人:piotr 抄 送:fhueske ; wangzhijiang999 ; user ; nico 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) Yes, I'm using Flink 1.5.0 and what I'm serializing is a really big record (probably too big, we have already started working to

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-13 Thread Zhijiang(wangzhijiang999)
framework. Also you can monitor the gc status to check the full gc delay. Best, Zhijiang -- 发件人:Gerard Garcia 发送时间:2018年7月13日(星期五) 16:22 收件人:wangzhijiang999 抄 送:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of m

回复:Limiting in flight data

2018-07-08 Thread Zhijiang(wangzhijiang999)
trics for some helps. -- 发件人:Vishal Santoshi 发送时间:2018年7月6日(星期五) 22:05 收件人:Zhijiang(wangzhijiang999) 抄 送:user 主 题:Re: Limiting in flight data Further if there is are metrics that allows us to chart delays per pipe on n/w buffers, that would be immensely help

回复:Handling back pressure in Flink.

2018-07-05 Thread Zhijiang(wangzhijiang999)
Hi Mich, From flink-1.5.0 the network flow control is improved by credit-based mechanism whichs handles backpressure better than before. The producer sends data based on the number of available buffers(credit) onconsumer side. If processing time on consumer side is slower than producing time on

回复:Limiting in flight data

2018-07-05 Thread Zhijiang(wangzhijiang999)
Hi Vishal, Before Flink-1.5.0, the sender tries best to send data on the network until the wire is filled with data. From Flink-1.5.0 the network flow control is improved by credit-based idea. That means the sender transfers data based on how many buffers avaiable on receiver side, so there wil

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-02 Thread Zhijiang(wangzhijiang999)
to trigger restarting the job. Zhijiang -- 发件人:Gerard Garcia 发送时间:2018年7月2日(星期一) 18:29 收件人:wangzhijiang999 抄 送:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) Thanks Zhijiang, We haven't found any

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-02 Thread Zhijiang(wangzhijiang999)
Hi Gerard, From the below stack, it can only indicate the task is canceled that may be triggered by job manager becuase of other task failure. If the task can not be interrupted within timeout config, the task managerprocess will be exited. Do you see any OutOfMemory messages from the task

回复:DataSet with Multiple reduce Actions

2018-06-27 Thread Zhijiang(wangzhijiang999)
Hi Osh, As I know, currently one dataset source can not be consumed by several different vertexs and from the API you can not construct the topology for your request. I think your way to merge different reduce functions into one UDF is feasible. Maybe someone has better solution. :) zhijiang

回复:Checkpoints very slow with high backpressure

2018-04-07 Thread Zhijiang(wangzhijiang999)
   Backpressure is indeed delayed the checkpoints because of gradually accumulated inflighting network buffers before barrier alignment.   As Piotr explained, 1.5 can improve to some extent.    After 1.5 we plan to further speed the checkpoint by controlling the channel reader to imp

回复:How back pressure works in Flink?

2018-03-06 Thread Zhijiang(wangzhijiang999)
Hi Pawel, The data transfer process on sender side is in the following way:operator collect record --> serilize to flink buffer --> copy to netty buffer --> flush to socket On receiver side: socket --> netty --> flink buffer --> deserialize to record --> operator process On receiver side, if the

回复:An addition to Netty's memory footprint

2017-06-30 Thread Zhijiang(wangzhijiang999)
Based on Kurt's scenario, if the cumulator allocates a big ByteBuf from ByteBufAllocator during expansion, it is easy to result in creating a new PoolChunk(16M) because of no consistent memory in current PoolChunks. And this will cause the total used direct memory beyond estimated. For further

回复:Question regarding configuring number of network buffers

2017-06-07 Thread Zhijiang(wangzhijiang999)
Hi Ray, For your question : Does that say that each parallel task inside the  TaskManager talk to all parallel tasks inside the same TaskManager or to all  parallel tasks across all task managers? Each task will talk to all parallel upstream and downstream tasks that both include the same TaskMana

回复:Multiple consumers on a subpartition

2017-04-25 Thread Zhijiang(wangzhijiang999)
Hi albert, As I know, if the upstream data will be consumed by multiple consumers, it will generate multiple subpartitions, and each subpartition will correspond to one input channel consumer.So it is one-to-one correspondence among subpartition -> subpartition view -> input channel. cheers,

回复:Yarn terminating TM for pmem limit cascades causing all jobs to fail

2017-04-19 Thread Zhijiang(wangzhijiang999)
Hi Shannon,    Have you tried to increase the total memory size for task manager  container?  Maybe the maximum memory requirement is beyond your current setting. And also you should check your UDF would not consume memory increasingly which  would not be recycled. If your UDF

回复:回复:Changing timeout for cancel command

2017-04-13 Thread Zhijiang(wangzhijiang999)
nager       - Un-registering task and sending final execution state CANCELED to JobManager for task Sink: /tmp/flink/events_invalid HDFS sink (477db6e41932ad9b60c72e14de4488ed) Best, Jürgen On 13.04.2017 07:05, Zhijiang(wangzhijiang999)

回复:Changing timeout for cancel command

2017-04-12 Thread Zhijiang(wangzhijiang999)
Hi Jürgen,      You can set the timeout in the configuration by this key "akka.ask.timeout", and the current default value is 10 s. Hope it can help you. cheers,zhijiang --发件人:Jürgen Thomann 发送时间:2017年4月12日(星期三) 19:04收件人:user 主 题:

回复:PartitionNotFoundException on deploying streaming job

2017-04-05 Thread Zhijiang(wangzhijiang999)
you find the reason for your case. Wish your sharing! cheers,Zhijiang--发件人:Kamil Dziublinski 发送时间:2017年4月5日(星期三) 16:07收件人:Zhijiang(wangzhijiang999) 抄 送:user 主 题:Re: PartitionNotFoundException on deploying streaming job Ok thanks I

回复:PartitionNotFoundException on deploying streaming job

2017-04-04 Thread Zhijiang(wangzhijiang999)
Hi Kamil,       When the producer receives the PartitionRequest from downstream task, first it will check whether the requested partition is already registered. If not, it will reponse PartitionNotFoundException.And the upstream task is submitted and begins to run, it will registered all its par

回复:question about record

2017-03-27 Thread Zhijiang(wangzhijiang999)
Hi  lining ,  The records would be serialized and wrote into the buffer. And the buffer will be sent out until it is full or exceeds the timeout.So different records may share the same buffer if the serialization result is less than the buffer size, otherwise one record may span multi buffer

回复:multiple consumer of intermediate data set

2017-03-14 Thread Zhijiang(wangzhijiang999)
-B1,A1-IntermediateResultPartition-B2,A2-IntermediateResultPartition-B1,  A2-IntermediateResultPartition-B2 in the right graph. Cheers, Zhijiang-发件人:lining jing 发送时间:2017年3月15日(星期三) 10:54收件人:user ; Zhijiang(wangzhijiang999) 主 题:Re

回复:multiple consumer of intermediate data set

2017-03-13 Thread Zhijiang(wangzhijiang999)
Hi ,      I think there is no difference between JobVertex(A) and JobVertex(B). Because the JobVertex(C) is not shown in the right graph, it may mislead you.There should be another intermediate result partition between JobVertex(B) and JobVertex(C) for each parallelism, and that is the same case

回复:TaskManager failure detection

2017-02-22 Thread Zhijiang(wangzhijiang999)
Hi Dominik,      As I know, the JobManager would detect the failure of TaskManager by akka watch mechanism. It is similar with heartbeat or ping way in network stack.You can refer to this link "https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors";.   Futhermore, the upstream and do

回复:Jobmanager was killed when disk less 10% in yarn

2017-02-19 Thread wangzhijiang999
The log just indicates the SignalHandler handles the kill signal and the process of JobManager exit , and it can not get the reason from it.You may check the container log from node manager why it was killed. Best, Zhijiang--发件人:lini

回复:Flink batch processing fault tolerance

2017-02-16 Thread wangzhijiang999
yes, it is really a critical problem for large batch job because the unexpected failure is a common case. And we are already focusing on realizing the ideas mentioned in FLIP1, wish to contirbute to flink in months. Best, Zhijiang--发

回复:problem of sharing TCP connection when transferring data

2016-05-24 Thread wangzhijiang999
Celebi 发送时间:2016年5月24日(星期二) 01:19收件人:user ; wangzhijiang999 主 题:Re: problem of sharing TCP connection when transferring data On Mon, May 23, 2016  at 6:55 PM, wangzhijiang999  wrote: >In summary, if one task set autoread as false, and when it notify the > available buffer, there ar

回复:problem of sharing TCP connection when transferring data

2016-05-23 Thread wangzhijiang999
application, and wish further contact with you for the professional advise. Thank you again!  Zhijiang Wang--发件人:Ufuk Celebi 发送时间:2016年5月23日(星期一) 19:49收件人:user ; wangzhijiang999 主 题:Re: problem of sharing TCP connection when

problem of sharing TCP connection when transferring data

2016-05-22 Thread wangzhijiang999
Hi,      I am confused with sharing tcp connection for the same connectionID, if two tasks share the same connection, and there is no available buffer in the local buffer pool of the first task  , then it will set autoread as false for the channel, but it will effect the second task if it still

回复:Blocking or pipelined mode for batch job

2016-05-10 Thread wangzhijiang999
= data.map(new MyMapper()); DataSet mapped2 = data.map(new AnotherMapper()); mapped1.join(mapped2).where(...).equalTo(...);--发件人:Ufuk Celebi 发送时间:2016年5月10日(星期二) 17:22收件人:user ; wangzhijiang999 主 题:Re: Blocking or pipelined mode for

Blocking or pipelined mode for batch job

2016-05-10 Thread wangzhijiang999
Hi ,        As I reviewed the flink source code, if the ExecutionMode is set "Pipelined", the DataExechangeMode will be set "Batch" if breaking pipelined property is true for two input or iteration situation in order to avoid deadlock. When the DataExechangeMode is set "Batch", the ResultPartiti

回复:where can get the summary changes between flink-1.0 and flink-0.10

2016-02-17 Thread wangzhijiang999
Hi Chiwan,       Thank you for instant reply, when will the official flink-1.0 be released, can you give a rough estimate?  I am interested in the new feature of flink-1.0 like operator uid in order to solve my current problem. Regards, Zhijiang Wang

回复:where can get the summary changes between flink-1.0 and flink-0.10

2016-02-17 Thread wangzhijiang999
Hi Chiwan,       Thank you for instant reply, when will the official flink-1.0 be released, can you give a rough estimate?  I am interested in the new feature of flink-1.0 like operator uid in order to solve my current problem. Regards, Zhijiang Wang

where can get the summary changes between flink-1.0 and flink-0.10

2016-02-16 Thread wangzhijiang999
Hi,    Where can get the summary changes between flink-1.0 and flink-0.10,   thank you in advance!   Best Regards, Zhijiang Wang

答复:UpdateTaskExecutionState during JobManager failover

2016-01-14 Thread wangzhijiang999
Hi Stephan,  Thank you for detail explaination.  As you said, my opition is to keep task still running druing jobmanager failover, even though sending update status failed. For the first reason you mentioned, if i understand correctly, the key issue is status out of sync between taskmanager and

UpdateTaskExecutionState during JobManager failover

2016-01-13 Thread wangzhijiang999
Hi,     As i know, when TaskManager send UpdateTaskExecutionState to JobManager, if the JobManager failover and the future response is fail, the task will be failed. Is it feasible to retry send UpdateTaskExecutionState again when future response fail until success. In JobManager HA mode, the  U

答复:答复:how to understand the flink flow control

2015-08-10 Thread wangzhijiang999
--发件人:Ufuk Celebi 发送时间:2015年8月10日(星期一) 17:13收件人:user ,wangzhijiang999 主 题:Re: 答复:how to understand the flink flow controlGood to hear. Answers to your questions are inline.I think we should move the discussion to the dev@flink.a.o list though if there are more questions.On Mon, Aug 10

答复:how to understand the flink flow control

2015-08-10 Thread wangzhijiang999
is right?   Looking forward to your docs! Best wishes, Zhijiang Wang --发件人:Ufuk Celebi 发送时间:2015年8月7日(星期五) 21:07收件人:user ,wangzhijiang999 主 题:Re: how to understand the flink flow controlHey Zhijiang Wang,I will update the docs  next

how to understand the flink flow control

2015-08-06 Thread wangzhijiang999
As said in apache page, Flink's streaming runtime has natural flow control:  Slow downstream operators backpressure faster upstream operators.How to understand the flink natural flow control? As i know, heron has the backpressure mechanism, if some tasks process slowly, it will stop reading from

答复:答复:thread model issue in TaskManager

2015-08-05 Thread wangzhijiang999
.  --发件人:Stephan Ewen 发送时间:2015年8月3日(星期一) 22:17收件人:user ,wangzhijiang999 主 题:Re: 答复:thread model issue in TaskManager  - Communication to the TaskManager, or directly to the JobManager  - Network stack for shuffles to exchange data with other processes (as exchanges go streaming and through

答复:thread model issue in TaskManager

2015-08-02 Thread wangzhijiang999
suggestions?  Thank you! --发件人:Stephan Ewen 发送时间:2015年8月3日(星期一) 00:36收件人:user 抄 送:wangzhijiang999 主 题:Re: thread model issue in TaskManagerHere are some additional things you can do:  - For isolation between parallel tasks (within a

thread model issue in TaskManager

2015-07-29 Thread wangzhijiang999
As I know, flink uses thread model in TaskManager, that means one taskmanager process may run many different operator threads,and these threads will compete the memory of the process. I know that flink has memoryManage component in each taskManager, and it will control the localBufferPool of Inp