[jira] [Commented] (FLINK-8625) Move OutputFlusher thread to Netty scheduled executor
[ https://issues.apache.org/jira/browse/FLINK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323581#comment-17323581 ] Flink Jira Bot commented on FLINK-8625: --- This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > Move OutputFlusher thread to Netty scheduled executor > - > > Key: FLINK-8625 > URL: https://issues.apache.org/jira/browse/FLINK-8625 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Piotr Nowojski >Assignee: Piotr Nowojski >Priority: Major > Labels: stale-assigned > > This will allow us to trigger/schedule next flush only if we are not > currently busy. > PR: https://github.com/apache/flink/pull/6698 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-8625) Move OutputFlusher thread to Netty scheduled executor
[ https://issues.apache.org/jira/browse/FLINK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366763#comment-16366763 ] Piotr Nowojski commented on FLINK-8625: --- Yes. Instead of this OutputFlasher thread we want to move this job to {{PartitionRequestQueue}} inside Netty's thread pool. \{{PartitionRequestQueue}} could just constantly "busy loop" over all of the subpartitions and if it encounters an "empty" subpartition, it could back of/try again later with some timeout. This will have following advantages: # No GC issues with constant and frequent scheduling data notification caused by current OutputFlusher thread. If we have 1000 output channels and 1ms flush timeout this adds up to 1 000 000 notifications per second. # No context switching and no synchronisation for the flushing # Instead of notifying 1000 output channels one by one as it's happening right now, one thread running one instance of \{{PartitionRequestQueue}} will handle all of the assigned to him output channels - with 10 Netty threads, each thread will handle ~100 output channels in one flush. # In case of full throughput scenario when all partitions are constantly full, we can skip the flush. > Move OutputFlusher thread to Netty scheduled executor > - > > Key: FLINK-8625 > URL: https://issues.apache.org/jira/browse/FLINK-8625 > Project: Flink > Issue Type: Sub-task > Components: Network >Reporter: Piotr Nowojski >Priority: Major > > This will allow us to trigger/schedule next flush only if we are not > currently busy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8625) Move OutputFlusher thread to Netty scheduled executor
[ https://issues.apache.org/jira/browse/FLINK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366531#comment-16366531 ] mingleizhang commented on FLINK-8625: - Hi, [~pnowojski] One thing I would like to confirm. We want to replace the original way which in every instance of {{StreamRecordWriter}} there will be create a thread to do the flush work by scheduling task with {{EventLoop}} ? > Move OutputFlusher thread to Netty scheduled executor > - > > Key: FLINK-8625 > URL: https://issues.apache.org/jira/browse/FLINK-8625 > Project: Flink > Issue Type: Sub-task > Components: Network >Reporter: Piotr Nowojski >Priority: Major > > This will allow us to trigger/schedule next flush only if we are not > currently busy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8625) Move OutputFlusher thread to Netty scheduled executor
[ https://issues.apache.org/jira/browse/FLINK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360611#comment-16360611 ] Piotr Nowojski commented on FLINK-8625: --- I have found one more thing. After fixing the current performance bottlenecks in https://issues.apache.org/jira/browse/FLINK-8581 , currently GC pressure caused by OutputFlasher is our biggest performance bottleneck/issue. OutputFlasher executed once per 1ms for 1000 output channels enqueue every 1ms 1000 elements on a internal Netty's executor. I presume those objects are pilling up and ending up in old GC generation. This GC pressure is causing huge throughput fluctuations (because of long GC pauses) between 20,000 records/ms down to 160 records/ms. Those long GC pauses are quite dangerous, since they can cause Jobs failure. > Move OutputFlusher thread to Netty scheduled executor > - > > Key: FLINK-8625 > URL: https://issues.apache.org/jira/browse/FLINK-8625 > Project: Flink > Issue Type: Sub-task > Components: Network >Reporter: Piotr Nowojski >Priority: Major > > This will allow us to trigger/schedule next flush only if we are not > currently busy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8625) Move OutputFlusher thread to Netty scheduled executor
[ https://issues.apache.org/jira/browse/FLINK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360533#comment-16360533 ] mingleizhang commented on FLINK-8625: - Thanks [~pnowojski] We can improve the performance and reduce the resources usage with executor, > Move OutputFlusher thread to Netty scheduled executor > - > > Key: FLINK-8625 > URL: https://issues.apache.org/jira/browse/FLINK-8625 > Project: Flink > Issue Type: Sub-task > Components: Network >Reporter: Piotr Nowojski >Priority: Major > > This will allow us to trigger/schedule next flush only if we are not > currently busy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)