[ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227405#comment-17227405 ]
Jason Kania commented on FLINK-16468: ------------------------------------- Up until now, I have been redirected from these activities. The problem still exists in that rapid reconnections occur, but I have not had any chance to investigate and looks like I won't for a while. If you wish to close, feel free and I will reference this issue if I am able to open and reinvestigate. > BlobClient rapid retrieval retries on failure opens too many sockets > -------------------------------------------------------------------- > > Key: FLINK-16468 > URL: https://issues.apache.org/jira/browse/FLINK-16468 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.8.3, 1.9.2, 1.10.0 > Environment: Linux ubuntu servers running, patch current latest > Ubuntu patch current release java 8 JRE > Reporter: Jason Kania > Priority: Major > Fix For: 1.12.0 > > > In situations where the BlobClient retrieval fails as in the following log, > rapid retries will exhaust the open sockets. All the retries happen within a > few milliseconds. > {noformat} > 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - > Failed to fetch BLOB > cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 > from aaa-1/10.0.1.1:45145 and store it under > /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004 > Retrying... > {noformat} > The above is output repeatedly until the following error occurs: > {noformat} > java.io.IOException: Could not connect to BlobServer at address > aaa-1/10.0.1.1:45145 > at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:478) > at java.net.Socket.connect(Socket.java:605) > at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95) > ... 8 more > {noformat} > The retries should have some form of backoff in this situation to avoid > flooding the logs and exhausting other resources on the server. -- This message was sent by Atlassian Jira (v8.3.4#803005)