[jira] [Updated] (SPARK-12617) socket descriptor leak killing streaming app

2016-01-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12617:
-
Affects Version/s: 1.6.0

> socket descriptor leak killing streaming app
> 
>
> Key: SPARK-12617
> URL: https://issues.apache.org/jira/browse/SPARK-12617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2, 1.6.0
> Environment: pyspark (python 2.6)
>Reporter: Antony Mayi
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
> Attachments: bug.py
>
>
> There is a socket descriptor leakage in a pyspark streaming app when 
> configured with batch interval more then 30 seconds. This is due to default 
> timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
> seconds of inactivity and creates new one next time. That connection doesn't 
> get closed on the python CallbackServer side and keep piling up until it 
> eventually blocks new connections.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
> connections of which 0 will ever be closed
> {code}
> [BUG] py4j callback server port: 51282
> [BUG] py4j CB 0/0 closed
> ...
> [BUG] py4j CB 0/123 closed
> {code}
> * You can confirm the reality by using lsof on the pyspark driver process:
> {code}
> $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
> python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
> python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
> python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
> python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
> python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
> python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
> ...
> {code}
> * If you leave it running for long enough the CallbackServer will eventually 
> become unable to accept new connections from the gateway and the app will 
> crash:
> {code}
> 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 145171140 ms
> py4j.Py4JException: Error while obtaining a new communication channel
> ...
> Caused by: java.net.ConnectException: Connection timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at java.net.Socket.(Socket.java:434)
> at java.net.Socket.(Socket.java:244)
> at py4j.CallbackConnection.start(CallbackConnection.java:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12617) socket descriptor leak killing streaming app

2016-01-04 Thread Antony Mayi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi updated SPARK-12617:

Description: 
There is a socket descriptor leakage in a pyspark streaming app when configured 
with batch interval more then 30 seconds. This is due to default timeout in 
py4j JavaGateway which (half-)closes CallbackConnection after 30 seconds of 
inactivity and creates new one next time. That connection doesn't get closed on 
the python CallbackServer side and keep piling up until it eventually blocks 
new connections.

h2. Steps to reproduce:
* Submit attached [^bug.py] to spark
* Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
connections of which 0 will ever be closed
{code}
[BUG] py4j callback server port: 51282
[BUG] py4j CB 0/0 closed
...
[BUG] py4j CB 0/123 closed
{code}
* You can confirm the reality by using lsof on the pyspark driver process:
{code}
$ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
...
{code}
* If you leave it running for long enough the CallbackServer will eventually 
become unable to accept new connections from the gateway and the app will crash:
{code}
16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 
145171140 ms
py4j.Py4JException: Error while obtaining a new communication channel
...
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.(Socket.java:434)
at java.net.Socket.(Socket.java:244)
at py4j.CallbackConnection.start(CallbackConnection.java:104)
{code}

  was:
There is a socket descriptor leakage in a pyspark streaming app when configured 
with batch interval slightly more then 30 seconds. This is due to default 
timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
seconds of inactivity and creates new one next time. That connection doesn't 
get closed on the python CallbackServer side and keep piling up until it 
eventually blocks new connections.

h2. Steps to reproduce:
* Submit attached [^bug.py] to spark
* Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
connections of which 0 will ever be closed
{code}
[BUG] py4j callback server port: 51282
[BUG] py4j CB 0/0 closed
...
[BUG] py4j CB 0/123 closed
{code}
* You can confirm the reality by using lsof on the pyspark driver process:
{code}
$ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
...
{code}
* If you leave it running for long enough the CallbackServer will eventually 
become unable to accept new connections from the gateway and the app will crash:
{code}
16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 
145171140 ms
py4j.Py4JException: Error while obtaining a new communication channel
...
Caused by: java.net.ConnectException: Connection timed out
 

[jira] [Updated] (SPARK-12617) socket descriptor leak killing streaming app

2016-01-04 Thread Antony Mayi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi updated SPARK-12617:

Description: 
There is a socket descriptor leakage in a pyspark streaming app when configured 
with batch interval slightly more then 30 seconds. This is due to default 
timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
seconds of inactivity and creates new one next time. That connection doesn't 
get closed on the python CallbackServer side and keep piling up until it 
eventually blocks new connections.

h2. Steps to reproduce:
* Submit attached [^bug.py] to spark
* Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
connections of which 0 will ever be closed
{code}
[BUG] py4j callback server port: 51282
[BUG] py4j CB 0/0 closed
...
[BUG] py4j CB 0/123 closed
{code}
* You can confirm the reality by using lsof on the pyspark driver process:
{code}
$ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
...
{code}
* If you leave it running for long enough the CallbackServer will eventually 
become unable to accept new connections from the gateway and the app will crash:
{code}
16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 
145171140 ms
py4j.Py4JException: Error while obtaining a new communication channel
...
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.(Socket.java:434)
at java.net.Socket.(Socket.java:244)
at py4j.CallbackConnection.start(CallbackConnection.java:104)
{code}

  was:
There is a socket descriptor leakage in a pyspark streaming app when configured 
with batch interval slightly more then 30 seconds. This is due to default 
timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
seconds of inactivity and creates new one next time. That connection don't get 
closed on the python CallbackServer side and keep piling up until it eventually 
blocks new connections.

h2. Steps to reproduce:
* Submit attached [^bug.py] to spark
* Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
connections of which 0 will ever be closed
{code}
[BUG] py4j callback server port: 51282
[BUG] py4j CB 0/0 closed
...
[BUG] py4j CB 0/123 closed
{code}
* You can confirm the reality by using lsof on the pyspark driver process:
{code}
$ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
...
{code}
* If you leave it running for long enough the CallbackServer will eventually 
become unable to accept new connections from the gateway and the app will crash:
{code}
16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 
145171140 ms
py4j.Py4JException: Error while obtaining a new communication channel
...
Caused by: java.net.ConnectException: Connection timed ou

[jira] [Updated] (SPARK-12617) socket descriptor leak killing streaming app

2016-01-04 Thread Antony Mayi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi updated SPARK-12617:

Description: 
There is a socket descriptor leakage in a pyspark streaming app when configured 
with batch interval slightly more then 30 seconds. This is due to default 
timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
seconds of inactivity and creates new one next time. That connection don't get 
closed on the python CallbackServer side and keep piling up until it eventually 
blocks new connections.

h2. Steps to reproduce:
* Submit attached [^bug.py] to spark
* Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
connections of which 0 will ever be closed
{code}
[BUG] py4j callback server port: 51282
[BUG] py4j CB 0/0 closed
...
[BUG] py4j CB 0/123 closed
{code}
* You can confirm the reality by using lsof on the pyspark driver process:
{code}
$ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
...
{code}
* If you leave it running for long enough the CallbackServer will eventually 
become unable to accept new connections from the gateway and the app will crash:
{code}
16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 
145171140 ms
py4j.Py4JException: Error while obtaining a new communication channel
...
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.(Socket.java:434)
at java.net.Socket.(Socket.java:244)
at py4j.CallbackConnection.start(CallbackConnection.java:104)
{code}

  was:
There is a socket descriptor leakage in a pyspark streaming app when configured 
with batch interval slightly more then 30 seconds. This is due to default 
timeout py4j JavaGateway which (half) closes CallbackConnection after 30 
seconds of inactivity and creates new one next time. That connection don't get 
closed on the python CallbackServer side and keep piling up until it eventually 
blocks new connections.

h2. Steps to reproduce:
* Submit attached [^bug.py] to spark
* Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
connections of which 0 will ever be closed
{code}
[BUG] py4j callback server port: 51282
[BUG] py4j CB 0/0 closed
...
[BUG] py4j CB 0/123 closed
{code}
* You can confirm the reality by using lsof on the pyspark driver process:
{code}
$ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
...
{code}
* If you leave it running for long enough the CallbackServer will eventually 
become unable to accept new connections from the gateway and the app will crash:
{code}
16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 
145171140 ms
py4j.Py4JException: Error while obtaining a new communication channel
...
Caused by: java.net.ConnectException: Connection timed out
   

[jira] [Updated] (SPARK-12617) socket descriptor leak killing streaming app

2016-01-04 Thread Antony Mayi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi updated SPARK-12617:

Attachment: bug.py

> socket descriptor leak killing streaming app
> 
>
> Key: SPARK-12617
> URL: https://issues.apache.org/jira/browse/SPARK-12617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: pyspark (python 2.6)
>Reporter: Antony Mayi
>Priority: Critical
> Attachments: bug.py
>
>
> There is a socket descriptor leakage in a pyspark streaming app when 
> configured with batch interval slightly more then 30 seconds. This is due to 
> default timeout py4j JavaGateway which (half) closes CallbackConnection after 
> 30 seconds of inactivity and creates new one next time. That connection don't 
> get closed on the python CallbackServer side and keep piling up until it 
> eventually blocks new connections.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
> connections of which 0 will ever be closed
> {code}
> [BUG] py4j callback server port: 51282
> [BUG] py4j CB 0/0 closed
> ...
> [BUG] py4j CB 0/123 closed
> {code}
> * You can confirm the reality by using lsof on the pyspark driver process:
> {code}
> $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
> python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
> python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
> python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
> python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
> python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
> python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
> ...
> {code}
> * If you leave it running for long enough the CallbackServer will eventually 
> become unable to accept new connections from the gateway and the app will 
> crash:
> {code}
> 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 145171140 ms
> py4j.Py4JException: Error while obtaining a new communication channel
> ...
> Caused by: java.net.ConnectException: Connection timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at java.net.Socket.(Socket.java:434)
> at java.net.Socket.(Socket.java:244)
> at py4j.CallbackConnection.start(CallbackConnection.java:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org