Thanks for checking. Your analysis sounds correct. The JM is busy processing job submissions, resulting in other submissions not being accepted.
Increasing rest.connection-timeout should resolve your problem. On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas <andreas.ha...@gs.com> wrote: > Thanks for pointing this out. We had a look - the nodes in our cluster > have a cap of 65K open files and we aren’t breaching 50% per metrics, so I > don’t believe this is the problem. > > The connection refused error makes us think it’s some process using a > thread pool for the JobManager hitting capacity on a port somewhere. This > sound correct? Is there a config for us to increase the pool size? > ------------------------------ > *From:* Robert Metzger <rmetz...@apache.org> > *Sent:* Wednesday, July 29, 2020 1:52:53 AM > *To:* Hailu, Andreas [Engineering] > *Cc:* user@flink.apache.org; Shah, Siddharth [Engineering] > *Subject:* Re: JobManager refusing connections when running many jobs in > parallel? > > Hi Andreas, > > Thanks for reaching out .. this should not happen ... > Maybe your operating system has configured low limits for the number of > concurrent connections / sockets. Maybe this thread is helpful: > https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections > <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_923990_why-2Ddo-2Di-2Dget-2Dconnection-2Drefused-2Dafter-2D1024-2Dconnections&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=BIYf5yz3admb-1ZnilWzxDnW3JB8d8VkHSSBMPTHaQI&s=bwNr69eekHCEpop2wur_LkAkIxqza-OjwNmG7cv8atc&e=> > (there > might better SO threads, I didn't put much effort into searching :) ) > > On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <andreas.ha...@gs.com> > wrote: > >> Hi team, >> >> >> >> We’ve observed that when we submit a decent number of jobs in parallel >> from a single Job Master, we encounter job failures due with Connection >> Refused exceptions. We’ve seen this behavior start at 30 jobs running in >> parallel. It’s seemingly transient, however, as upon several retries the >> job succeeds. The surface level error varies, but digging deeper in stack >> traces it looks to stem from the Job Manager no longer accepting >> connections. >> >> >> >> I’ve included a couple of examples below from failed jobs’ driver logs, >> with different errors stemming from a connection refused error: >> >> >> >> First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 >> Task Manager memory - 30 jobs submitted in parallel, each with parallelism >> of 1 >> >> *Job Manager is running @ d43723-563.dc.gs.com >> <http://d43723-563.dc.gs.com>*: Using job manager web tracking url <a >> href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface ( >> http://d43723-563.dc.gs.com:41268) </a> >> >> org.apache.flink.client.program.ProgramInvocationException: Could not >> retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6) >> >> at >> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255) >> >> at >> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) >> >> at >> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326) >> >> at >> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) >> >> ... >> >> Caused by: >> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not >> complete the operation. Number of retries has been exhausted. >> >> at >> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273) >> >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >> >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >> >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >> >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >> >> at >> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341) >> >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) >> >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) >> >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) >> >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) >> >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) >> >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) >> >> ... 1 more >> >> Caused by: java.util.concurrent.CompletionException: >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: >> *Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268 >> <http://d43723-563.dc.gs.com/10.47.126.221:41268>* >> >> at >> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) >> >> at >> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) >> >> at >> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) >> >> at >> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) >> >> ... 16 more >> >> Caused by: >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: >> Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268 >> >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> >> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) >> >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) >> >> ... 6 more >> >> Caused by: java.net.ConnectException: Connection refused >> >> >> >> Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 >> Task Manager memory - 60 jobs submitted in parallel, each with parallelism >> of 1 >> >> *Job Manager is running @ d43723-484.dc.gs.com >> <http://d43723-484.dc.gs.com>*: Using job manager web tracking url <a >> href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface ( >> http://d43723-484.dc.gs.com:36757) </a> >> >> org.apache.flink.client.program.ProgramInvocationException: Could not >> retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d) >> >> at >> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255) >> >> at >> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) >> >> at >> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326) >> >> at >> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) >> >> ... >> >> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed >> to submit JobGraph. >> >> at >> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382) >> >> at >> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) >> >> at >> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) >> >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >> >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >> >> at >> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263) >> >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >> >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >> >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >> >> at >> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561) >> >> at >> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929) >> >> at >> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) >> >> ... 3 more >> >> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could >> not upload job files.] >> >> at >> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389) >> >> at >> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373) >> >> at >> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) >> >> at >> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) >> >> ... 4 more >> >> ... (this pattern repeats for number of unique JobIDs) >> >> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could >> not upload job files.] >> >> at >> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389) >> >> at >> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373) >> >> at >> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) >> >> at >> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) >> >> at >> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> >> at java.lang.Thread.run(Thread.java:745) >> >> ... >> >> 26 05:46:39,734 [CASHFLOW-18394] WARN FlinkClusterStateMonitor - Error >> while attempting to fetch job details for job >> 4d20537a676df2855e29b31b1de1ead5 >> >> com.gs.ep.data.lake.refinerlib.restful.RestfulException: * failed >> connecting to >> http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 >> <http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5> >> after 1 time(s)* >> >> *Caused by: java.net.ConnectException: Connection refused* >> >> at java.net.PlainSocketImpl.socketConnect(Native Method) >> >> at >> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) >> >> at >> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) >> >> at >> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) >> >> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >> >> at java.net.Socket.connect(Socket.java:589) >> >> at java.net.Socket.connect(Socket.java:538) >> >> at sun.net.NetworkClient.doConnect(NetworkClient.java:180) >> >> >> >> These connection refusal exceptions and their transient nature makes me >> think that it might be a network-related issue. It’s not uncommon for us to >> need to run 100+ jobs in parallel. How can we investigate what’s causing >> the Job Manager to periodically refuse connections? I can see a Netty >> package in the first example’s stack trace – is there any option we can >> tune? >> >> >> >> ____________ >> >> >> >> *Andreas Hailu* >> >> *Data Lake Engineering *| Goldman Sachs & Co. >> >> >> >> ------------------------------ >> >> Your Personal Data: We may collect and process information about you that >> may be subject to data protection laws. For more information about how we >> use and disclose your personal data, how we protect your information, our >> legal basis to use your information, your rights and who you can contact, >> please refer to: www.gs.com/privacy-notices >> > > ------------------------------ > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices >