Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
FYI, finally, I figured out the root cause:
https://github.com/netty/netty/issues/5833
As far as I understand, `System.setProperty("io.netty.maxDirectMemory",
"0");` should be a correct wor
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Agreed. I'm going to merge to master. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this featu
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
OK, I think this is a good change. Maybe to be conservative we'll only put
this in master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as w
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> @zsxwing you seem to understand this better, but is it that the default
behavior changes and is probably a bad default now, or just that it's
inappropriate for Spark?
I don't have a theor
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> For future reference here is the context of how that option is used:
https://github.com/netty/netty/blob/e7449b1ef361c55457ed21d44d6ed8387ec1fa45/common/src/main/java/io/netty/util/internal/Platfor
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65361/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65361 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65361/consoleFull)**
for PR 14961 at commit
[`8f6783b`](https://github.com/apache/spark/commit/
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Nice research here. So that's probably the only real way to set this
property? it has to be a system property I guess and this should fire before
the classes in questions init as far as I can see.
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Thanks @zsxwing, I've removed our older experiments in favour of this one
For future reference here is the context of how that option is used:
https://github.com/netty/netty/blob/e7449b1ef
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65361 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65361/consoleFull)**
for PR 14961 at commit
[`8f6783b`](https://github.com/apache/spark/commit/8
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Confirmed the issue was introduced by
https://github.com/netty/netty/commit/d58dec8862e02fc2a98f8dcdb166db4b788be50a#diff-8d83d75ebf8a18cc48bf0a0b1183c188
Add `System.setProperty("io.netty.m
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, I can reproduce the same error using this command `build/sbt "project
core" "test-only *Shuffle*"` locally. The first broken version is 4.0.37.Final.
---
If your project is set up for it, you c
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Oh, the allocator is set here:
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java#L95
---
If your project is set up
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
@a-roberts could you binary search the first broken netty version? Since
this cannot be reproduced locally, you have to push new commits.
---
If your project is set up for it, you can reply to this
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Still saw the following errors in the unit-test log:
```
16/09/13 07:41:18.817 shuffle-server-466-7 WARN TransportChannelHandler:
Exception in connection from /127.0.0.1:36871
io.net
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65322/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65322 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65322/consoleFull)**
for PR 14961 at commit
[`faefd9c`](https://github.com/apache/spark/commit/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65322 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65322/consoleFull)**
for PR 14961 at commit
[`faefd9c`](https://github.com/apache/spark/commit/f
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Had a look to see how to do this
https://github.com/netty/netty/blob/a01519e4f86690323647b5db45d9ffcb184b1a84/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java
so I'll add io.
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Yep that makes more sense, UnpooledByteBufAllocator usage coming up
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
OK, so same failure with this change. Hm. I don't think it's that something
is just slow but that the error in
https://github.com/apache/spark/pull/14961#issuecomment-245090209 causes netty
to never
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
[info] - using external shuffle service *** FAILED *** (1 minute)
[info] java.util.concurrent.TimeoutException: Can't find 2 executors
before 6 milliseconds elapsed
60 seconds re
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3256 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3256/consoleFull)**
for PR 14961 at commit
[`502ebf4`](https://github.com/apache/spark/commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3256 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3256/consoleFull)**
for PR 14961 at commit
[`502ebf4`](https://github.com/apache/spark/commit/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65264/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65264 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65264/consoleFull)**
for PR 14961 at commit
[`502ebf4`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65264 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65264/consoleFull)**
for PR 14961 at commit
[`502ebf4`](https://github.com/apache/spark/commit/5
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
No new test failures with my runs ranging from Hadoop 2.3 to Hadoop 2.7
today so pushed the commit above
---
If your project is set up for it, you can reply to this email and have your
reply appe
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Sean, yep, I've had trouble reproducing it too, kicked off a bunch of
builds over the weekend including one using Hadoop-2.3 which was my initial
theory (only difference between our testing enviro
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
@a-roberts are you in a position to add this change to this PR as an
experiment? I can try it on the side too. I can't seem to reproduce the failure
locally, even when fully rebuilding the project wi
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
In addition, I think we should figure out why upgrading the netty version
will fail. The issue about Recycler seems also in `4.0.29.Final`. Is it because
netty starts to track the memoryprint since
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> I suppose one hacky way to test the theory above is to push a commit here
that sets this in NettyUtils:
Let's add it in `TransportConf` so that it's easy to find since it's the
place of t
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
I'm not familiar with netty's Recycler. But the default value of
`io.netty.recycler.maxCapacity` is 262144. This seems too big for Spark anyway.
I don't think we need to cache 260k objects.
---
If
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, I can't get this test to fail with Netty 4.0.41 when I 'mvn install'
and run the test suite locally. I'm having a hard time seeing what could
alleviate the failure.
I suspect that this c
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
I think we can binary search the first broken netty version. It would be
easy to find out the real issue.
---
If your project is set up for it, you can reply to this email and have your
reply appea
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> Is the lesson here to not bother with pooling and use the
UnpooledByteBufAllocator?
Not sure. Pooling is for improving the performance because allocating
direct buffers is pretty slow.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Aha, possibly this:
https://groups.google.com/forum/#!topic/netty/3BoF7q34Z4I
Is the lesson here to not bother with pooling and use the
UnpooledByteBufAllocator?
---
If your project is
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
I saw the error in the log:
```
16/09/05 08:21:56.758 shuffle-server-593-8 WARN TransportChannelHandler:
Exception in connection from /127.0.0.1:44788
io.netty.util.internal.OutOfDirectMe
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, no I take it back, it's a consistent failure that doesn't show up in
the main test builds (for any Hadoop version):
```
[info] - using external shuffle service *** FAILED *** (1 minut
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3249 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3249/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Thanks, I did a ctrl-f for "** fail", you'd have a better idea of what the
known flakies are in this farm though, my quick checking:
- using external shuffle service -> looks to be a timeo
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, I see just one in the PR builder here, really. And it's different from
run to run so this could well be spurious. Re-running tests one more time here.
---
If your project is set up for it, you c
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3249 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3249/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit/
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
In the description I mentioned that for testing I used "Existing unit tests
against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z
architectures", so clarifying that I only used
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Are you saying thousands of tests fail with certain Hadoop versions and
this version change? That's hard to believe. I'd be very surprised if this
caused a test failure. However I do see this PR fail
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Thanks, so are we saying netty 4.0.29 can't be upgraded to 4.0.41 without
breaking changes? That's not even a minor version change...
On branch 1.6 with the netty change for myself I see 8
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
@jerryshao that's a good point though in theory a maintenance release
contains no API or behavior changes (that aren't bugs). Let's perhaps not touch
1.6 then to be conservative. Hadoop uses a differ
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/14961
Also many other downstream and upstream applications may also use different
version of Netty jar, it would be better to keep stable for these fundamental
dependences.
---
If your project is set
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/14961
Upgrading Netty version to branch 1.6 may cause API version incompatible
issue for yarn shuffle service, please see
[SPARK-16018](https://issues.apache.org/jira/browse/SPARK-16018) and
[SPARK-151
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3247 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3247/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3247 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3247/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3246 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3246/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3246 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3246/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64938/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #64938 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64938/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit/
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Looks good for master to 1.6
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wis
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #64938 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64938/consoleFull)**
for PR 14961 at commit
[`38ca07b`](https://github.com/apache/spark/commit/3
61 matches
Mail list logo