[ https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antony Mayi updated SPARK-12617: -------------------------------- Description: There is a socket descriptor leakage in a pyspark streaming app when configured with batch interval slightly more then 30 seconds. This is due to default timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 seconds of inactivity and creates new one next time. That connection don't get closed on the python CallbackServer side and keep piling up until it eventually blocks new connections. h2. Steps to reproduce: * Submit attached [^bug.py] to spark * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback connections of which 0 will ever be closed {code} [BUG] py4j callback server port: 51282 [BUG] py4j CB 0/0 closed ... [BUG] py4j CB 0/123 closed {code} * You can confirm the reality by using lsof on the pyspark driver process: {code} $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282 python2.6 39770 das 94u IPv4 138824906 0t0 TCP localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT) python2.6 39770 das 95u IPv4 138867747 0t0 TCP localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT) python2.6 39770 das 96u IPv4 138831829 0t0 TCP localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT) python2.6 39770 das 97u IPv4 138890524 0t0 TCP localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT) python2.6 39770 das 98u IPv4 138860190 0t0 TCP localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT) python2.6 39770 das 99u IPv4 138860439 0t0 TCP localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT) ... {code} * If you leave it running for long enough the CallbackServer will eventually become unable to accept new connections from the gateway and the app will crash: {code} 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 1451711400000 ms py4j.Py4JException: Error while obtaining a new communication channel ... Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:244) at py4j.CallbackConnection.start(CallbackConnection.java:104) {code} was: There is a socket descriptor leakage in a pyspark streaming app when configured with batch interval slightly more then 30 seconds. This is due to default timeout py4j JavaGateway which (half) closes CallbackConnection after 30 seconds of inactivity and creates new one next time. That connection don't get closed on the python CallbackServer side and keep piling up until it eventually blocks new connections. h2. Steps to reproduce: * Submit attached [^bug.py] to spark * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback connections of which 0 will ever be closed {code} [BUG] py4j callback server port: 51282 [BUG] py4j CB 0/0 closed ... [BUG] py4j CB 0/123 closed {code} * You can confirm the reality by using lsof on the pyspark driver process: {code} $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282 python2.6 39770 das 94u IPv4 138824906 0t0 TCP localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT) python2.6 39770 das 95u IPv4 138867747 0t0 TCP localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT) python2.6 39770 das 96u IPv4 138831829 0t0 TCP localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT) python2.6 39770 das 97u IPv4 138890524 0t0 TCP localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT) python2.6 39770 das 98u IPv4 138860190 0t0 TCP localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT) python2.6 39770 das 99u IPv4 138860439 0t0 TCP localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT) ... {code} * If you leave it running for long enough the CallbackServer will eventually become unable to accept new connections from the gateway and the app will crash: {code} 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 1451711400000 ms py4j.Py4JException: Error while obtaining a new communication channel ... Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:244) at py4j.CallbackConnection.start(CallbackConnection.java:104) {code} > socket descriptor leak killing streaming app > -------------------------------------------- > > Key: SPARK-12617 > URL: https://issues.apache.org/jira/browse/SPARK-12617 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming > Affects Versions: 1.5.2 > Environment: pyspark (python 2.6) > Reporter: Antony Mayi > Priority: Critical > Attachments: bug.py > > > There is a socket descriptor leakage in a pyspark streaming app when > configured with batch interval slightly more then 30 seconds. This is due to > default timeout in py4j JavaGateway which (half-)closes CallbackConnection > after 30 seconds of inactivity and creates new one next time. That connection > don't get closed on the python CallbackServer side and keep piling up until > it eventually blocks new connections. > h2. Steps to reproduce: > * Submit attached [^bug.py] to spark > * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback > connections of which 0 will ever be closed > {code} > [BUG] py4j callback server port: 51282 > [BUG] py4j CB 0/0 closed > ... > [BUG] py4j CB 0/123 closed > {code} > * You can confirm the reality by using lsof on the pyspark driver process: > {code} > $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282 > python2.6 39770 das 94u IPv4 138824906 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT) > python2.6 39770 das 95u IPv4 138867747 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT) > python2.6 39770 das 96u IPv4 138831829 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT) > python2.6 39770 das 97u IPv4 138890524 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT) > python2.6 39770 das 98u IPv4 138860190 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT) > python2.6 39770 das 99u IPv4 138860439 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT) > ... > {code} > * If you leave it running for long enough the CallbackServer will eventually > become unable to accept new connections from the gateway and the app will > crash: > {code} > 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for > time 1451711400000 ms > py4j.Py4JException: Error while obtaining a new communication channel > ... > Caused by: java.net.ConnectException: Connection timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at java.net.Socket.<init>(Socket.java:434) > at java.net.Socket.<init>(Socket.java:244) > at py4j.CallbackConnection.start(CallbackConnection.java:104) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org