Thanks for checking.

Your analysis sounds correct. The JM is busy processing job submissions,
resulting in other submissions not being accepted.

Increasing rest.connection-timeout should resolve your problem.


On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas <andreas.ha...@gs.com> wrote:

> Thanks for pointing this out. We had a look - the nodes in our cluster
> have a cap of 65K open files and we aren’t breaching 50% per metrics, so I
> don’t believe this is the problem.
>
> The connection refused error makes us think it’s some process using a
> thread pool for the JobManager hitting capacity on a port somewhere. This
> sound correct? Is there a config for us to increase the pool size?
> ------------------------------
> *From:* Robert Metzger <rmetz...@apache.org>
> *Sent:* Wednesday, July 29, 2020 1:52:53 AM
> *To:* Hailu, Andreas [Engineering]
> *Cc:* user@flink.apache.org; Shah, Siddharth [Engineering]
> *Subject:* Re: JobManager refusing connections when running many jobs in
> parallel?
>
> Hi Andreas,
>
> Thanks for reaching out .. this should not happen ...
> Maybe your operating system has configured low limits for the number of
> concurrent connections / sockets. Maybe this thread is helpful:
> https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_923990_why-2Ddo-2Di-2Dget-2Dconnection-2Drefused-2Dafter-2D1024-2Dconnections&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=BIYf5yz3admb-1ZnilWzxDnW3JB8d8VkHSSBMPTHaQI&s=bwNr69eekHCEpop2wur_LkAkIxqza-OjwNmG7cv8atc&e=>
>  (there
> might better SO threads, I didn't put much effort into searching :) )
>
> On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <andreas.ha...@gs.com>
> wrote:
>
>> Hi team,
>>
>>
>>
>> We’ve observed that when we submit a decent number of jobs in parallel
>> from a single Job Master, we encounter job failures due with Connection
>> Refused exceptions. We’ve seen this behavior start at 30 jobs running in
>> parallel. It’s seemingly transient, however, as upon several retries the
>> job succeeds. The surface level error varies, but digging deeper in stack
>> traces it looks to stem from the Job Manager no longer accepting
>> connections.
>>
>>
>>
>> I’ve included a couple of examples below from failed jobs’ driver logs,
>> with different errors stemming from a connection refused error:
>>
>>
>>
>> First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288
>> Task Manager memory - 30 jobs submitted in parallel, each with parallelism
>> of 1
>>
>> *Job Manager is running @ d43723-563.dc.gs.com
>> <http://d43723-563.dc.gs.com>*: Using job manager web tracking url <a
>> href="http://d43723-563.dc.gs.com:41268";> Job Manager Web Interface  (
>> http://d43723-563.dc.gs.com:41268) </a>
>>
>> org.apache.flink.client.program.ProgramInvocationException: Could not
>> retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)
>>
>> at
>> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
>>
>> at
>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>>
>> at
>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>>
>> at
>> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>>
>> ...
>>
>> Caused by:
>> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not
>> complete the operation. Number of retries has been exhausted.
>>
>> at
>> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>
>> at
>> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>
>> ... 1 more
>>
>> Caused by: java.util.concurrent.CompletionException:
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>> *Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268
>> <http://d43723-563.dc.gs.com/10.47.126.221:41268>*
>>
>> at
>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>>
>> at
>> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>>
>> ... 16 more
>>
>> Caused by:
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>> Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268
>>
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>
>> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
>>
>> ... 6 more
>>
>> Caused by: java.net.ConnectException: Connection refused
>>
>>
>>
>> Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288
>> Task Manager memory - 60 jobs submitted in parallel, each with parallelism
>> of 1
>>
>> *Job Manager is running @ d43723-484.dc.gs.com
>> <http://d43723-484.dc.gs.com>*: Using job manager web tracking url <a
>> href="http://d43723-484.dc.gs.com:36757";> Job Manager Web Interface  (
>> http://d43723-484.dc.gs.com:36757) </a>
>>
>> org.apache.flink.client.program.ProgramInvocationException: Could not
>> retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)
>>
>> at
>> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
>>
>> at
>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>>
>> at
>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>>
>> at
>> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>>
>> ...
>>
>> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed
>> to submit JobGraph.
>>
>> at
>> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>>
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>
>> at
>> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>
>> at
>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>>
>> at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>>
>> ... 3 more
>>
>> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could
>> not upload job files.]
>>
>> at
>> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
>>
>> at
>> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>>
>> ... 4 more
>>
>> ... (this pattern repeats for number of unique JobIDs)
>>
>> Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could
>> not upload job files.]
>>
>> at
>> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
>>
>> at
>> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>>
>> at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> ...
>>
>> 26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error
>> while attempting to fetch job details for job
>> 4d20537a676df2855e29b31b1de1ead5
>>
>> com.gs.ep.data.lake.refinerlib.restful.RestfulException: * failed
>> connecting to
>> http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5
>> <http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5>
>> after 1 time(s)*
>>
>> *Caused by: java.net.ConnectException: Connection refused*
>>
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>
>> at
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>>
>> at
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>>
>> at
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>>
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>
>> at java.net.Socket.connect(Socket.java:589)
>>
>> at java.net.Socket.connect(Socket.java:538)
>>
>> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
>>
>>
>>
>> These connection refusal exceptions and their transient nature makes me
>> think that it might be a network-related issue. It’s not uncommon for us to
>> need to run 100+ jobs in parallel. How can we investigate what’s causing
>> the Job Manager to periodically refuse connections? I can see a Netty
>> package in the first example’s stack trace – is there any option we can
>> tune?
>>
>>
>>
>> ____________
>>
>>
>>
>> *Andreas Hailu*
>>
>> *Data Lake Engineering *| Goldman Sachs & Co.
>>
>>
>>
>> ------------------------------
>>
>> Your Personal Data: We may collect and process information about you that
>> may be subject to data protection laws. For more information about how we
>> use and disclose your personal data, how we protect your information, our
>> legal basis to use your information, your rights and who you can contact,
>> please refer to: www.gs.com/privacy-notices
>>
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>

Reply via email to