[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2017-07-11 Thread Andrian Jardan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082790#comment-16082790
 ] 

Andrian Jardan commented on SPARK-6962:
---

We're also facing this issue on 1.6, are there any plans to solve it ?

Can we help ?

> Netty BlockTransferService hangs in the middle of SQL query
> ---
>
> Key: SPARK-6962
> URL: https://issues.apache.org/jira/browse/SPARK-6962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Jon Chase
> Attachments: jstacks.txt
>
>
> Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
> using queries in the REPL to surface this, so I mention Spark SQL) hang 
> indefinitely under certain (not totally understood) circumstances.  
> This is resolved by setting spark.shuffle.blockTransferService=nio, which 
> seems to point to netty as the issue.  Netty was set as the default for the 
> block transport layer in 1.2.0, which is when this issue started.  Setting 
> the service to nio allows queries to complete normally.
> I do not see this problem when running queries over smaller (~20 5MB files) 
> datasets.  When I increase the scope to include more data (several hundred 
> ~5MB files), the queries will get through several steps but eventuall hang  
> indefinitely.
> Here's the email chain regarding this issue, including stack traces:
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/
> For context, here's the announcement regarding the block transfer service 
> change: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-11-15 Thread Romi Kuntsman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005851#comment-15005851
 ] 

Romi Kuntsman commented on SPARK-6962:
--

what's the status of this?
something similar happens to me in 1.4.0 and also in 1.5.1
the job hangs forever with the largest shuffle

when increasing the number of partitions (as a function of the data size), the 
issue is fixed

> Netty BlockTransferService hangs in the middle of SQL query
> ---
>
> Key: SPARK-6962
> URL: https://issues.apache.org/jira/browse/SPARK-6962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Jon Chase
> Attachments: jstacks.txt
>
>
> Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
> using queries in the REPL to surface this, so I mention Spark SQL) hang 
> indefinitely under certain (not totally understood) circumstances.  
> This is resolved by setting spark.shuffle.blockTransferService=nio, which 
> seems to point to netty as the issue.  Netty was set as the default for the 
> block transport layer in 1.2.0, which is when this issue started.  Setting 
> the service to nio allows queries to complete normally.
> I do not see this problem when running queries over smaller (~20 5MB files) 
> datasets.  When I increase the scope to include more data (several hundred 
> ~5MB files), the queries will get through several steps but eventuall hang  
> indefinitely.
> Here's the email chain regarding this issue, including stack traces:
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/
> For context, here's the announcement regarding the block transfer service 
> change: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-07-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633005#comment-14633005
 ] 

Reynold Xin commented on SPARK-6962:


[~jonchase] do you still see the problem on 1.4 or in master branch?


 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-07-19 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633016#comment-14633016
 ] 

Jon Chase commented on SPARK-6962:
--

I'll check tomorrow on 1.4.0.

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-19 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502270#comment-14502270
 ] 

Aaron Davidson commented on SPARK-6962:
---

I created SPARK-7003 to track a fix to the potential problem I noted, and a PR 
to follow: https://github.com/apache/spark/pull/5584

If it would be possible to pull in that patch, it either may fix the issue 
you're seeing (by performing retries in the event of network faults) or at 
least fail after a few minutes rather than hanging indefinitely -- either 
result would be interesting.

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499308#comment-14499308
 ] 

Aaron Davidson commented on SPARK-6962:
---

Executor logs in particular. Are all remaining tasks hanging and on all 
different machines? Similar to what Patrick said, if there's an asymmetry on 
the machines it could suggest one has stopped responding and everyone is 
waiting on it. It's possible that only one Executor is behaving in an erratic 
way, though it's abnormal too that the connection didn't just timeout after 
some time and the task be retried.

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500224#comment-14500224
 ] 

Michael Allman commented on SPARK-6962:
---

[~ilikerps] Okay. We're still in the process of our Spark 1.3 migration. Once 
that's complete I will run some test queries and check the executor logs. 
Should I set the log level to debug or is that too noisy?

Also, I forgot to mention here that we seem to have found an effective 
workaround by setting spark.shuffle.blockTransferService to nio rather than the 
default netty. This has been confirmed to be working by two other members of 
the mailing list.

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500406#comment-14500406
 ] 

Jon Chase commented on SPARK-6962:
--

I'm tailing the executor logs before/as this is happening and I don't see 
anything out of the ordinary (errors, etc.)  Here's what the logs look like 
when the lockup occurs (again, not seeing anything out of the ordinary).  I 
tailed all executor's, and all of the logs look similar to this.

== 
/mnt/var/log/hadoop/yarn-hadoop-nodemanager-ip-XX-XX-XX-XXX.eu-west-1.compute.internal.log
 ==
2015-04-17 18:27:58,206 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used
2015-04-17 18:28:01,214 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used
2015-04-17 18:28:04,221 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used
2015-04-17 18:28:07,229 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500419#comment-14500419
 ] 

Jon Chase commented on SPARK-6962:
--

Looking at the UI when the lock up occurs, I see that every executor has 4 
active tasks.  It's not the case that, say, only a single executor has a task 
running - they all appear to be busy while locked up.  

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500429#comment-14500429
 ] 

Jon Chase commented on SPARK-6962:
--

Here's the stderr from the executors at the time of the lock up (there are 3 
executors).

18:26:00 is when the lockup happened, and after 20+ minutes, these are still 
the most recent logs in executor 1:

15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1132
15/04/17 18:26:00 INFO executor.Executor: Running task 110.0 in stage 15.0 (TID 
1132)
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 3 ms
15/04/17 18:26:00 INFO executor.Executor: Finished task 107.0 in stage 15.0 
(TID 1129). 8325 bytes result sent to driver
15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1133
15/04/17 18:26:00 INFO executor.Executor: Running task 111.0 in stage 15.0 (TID 
1133)
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 2 ms




Here's executor 2, it doesn't have any activity for about 20 minutes (again, 
the lockup happened at ~18:26:00):

15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 11 ms
15/04/17 18:25:49 INFO executor.Executor: Finished task 13.0 in stage 15.0 (TID 
1035). 12013 bytes result sent to driver
15/04/17 18:25:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1068
15/04/17 18:25:49 INFO executor.Executor: Running task 46.0 in stage 15.0 (TID 
1068)
15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 16 ms
15/04/17 18:41:19 WARN server.TransportChannelHandler: Exception in connection 
from /10.106.144.109:49697
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/04/17 18:41:27 WARN server.TransportChannelHandler: Exception in connection 
from /10.106.145.10:38473
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)



Same with executor 3:

15/04/17 18:25:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1092
15/04/17 18:25:52 INFO 

[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501054#comment-14501054
 ] 

Aaron Davidson commented on SPARK-6962:
---

Thanks for those log excerpts. It is likely significant that each IP appeared 
exactly once in a connection exception among the executors. Given this warning, 
but no corresponding error Still have X requests outstanding when connection 
from 10.106.143.39 is closed, I also would be inclined to deduce that only the 
TransportServer-side of the socket is timing out, and that for some reason the 
connection exception is not reaching the client side of the socket (which would 
have caused the outstanding fetch requests to fail promptly).

If this situation could arise, then each client could be waiting indefinitely 
for some other server to respond, which it will not. Is your cluster in any 
sort of unusual network configuration?

Even so, this only could explain why the hang is indefinite, not why all 
communication is paused for 20 minutes leading up to it.

To further diagnose this, it would actually be very useful if you could turn on 
TRACE level debugging for org.apache.spark.storage.ShuffleBlockFetcherIterator 
and org.apache.spark.network (this should look like 
{{log4j.logger.org.apache.spark.network=TRACE}} in the log4j.properties).

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501063#comment-14501063
 ] 

Aaron Davidson commented on SPARK-6962:
---

I have a hypothesis that the above is caused by assumptions we make about the 
eventuality/symmetry of socket timeouts that are not guaranteed in arbitrary 
network topologies. If this is the case, though, then I would also expect nio 
to have intermittent failures, though it could at least recover from them.

A potential fix would be a timer thread in 
[TransportResponseHandler|https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java]
 which would require that there is some message received every N seconds (the 
default network timeout in Spark is 120 seconds) as long as there is some 
outstanding request. This should be fairly robust due to our use of retries on 
IOExceptions in 
[RetryingBlockFetcher|https://github.com/apache/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java].

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-16 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498991#comment-14498991
 ] 

Michael Allman commented on SPARK-6962:
---

[~adav]Which logs would be helpful?
[~pwend...@gmail.com]I've seen this problem occur where a stage is hung waiting 
for multiple tasks from more than one executor to complete. Also, the GC time 
as reported for the blocked tasks is insignificant, or at least nothing odd 
compared to the other tasks.

Additionally, I see no unusual CPU usage or load level. The tasks seem to be 
simply idle, waiting for some never-to-be-received input. Also, I see the same 
thread stack trace as the OP (the thread whose stack includes the line 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:278)).
 I think that signal can be used to distinguish this hang from others.

I've also just confirmed with [~rxin] on the mailing list that I'm still seeing 
this problem on branch-1.3 as of 
https://github.com/apache/spark/commit/6d3c4d8b04b2738a821dfcc3df55a5635b89e506.

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org