Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
FYI, finally, I figured out the root cause:
https://github.com/netty/netty/issues/5833
As far as I understand, `System.setProperty("io.netty.maxDirectMemory",
"0");` should be a correct
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Agreed. I'm going to merge to master. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
OK, I think this is a good change. Maybe to be conservative we'll only put
this in master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> @zsxwing you seem to understand this better, but is it that the default
behavior changes and is probably a bad default now, or just that it's
inappropriate for Spark?
I don't have a
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> For future reference here is the context of how that option is used:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65361/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65361 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65361/consoleFull)**
for PR 14961 at commit
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Nice research here. So that's probably the only real way to set this
property? it has to be a system property I guess and this should fire before
the classes in questions init as far as I can see.
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Thanks @zsxwing, I've removed our older experiments in favour of this one
For future reference here is the context of how that option is used:
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65361 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65361/consoleFull)**
for PR 14961 at commit
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Confirmed the issue was introduced by
https://github.com/netty/netty/commit/d58dec8862e02fc2a98f8dcdb166db4b788be50a#diff-8d83d75ebf8a18cc48bf0a0b1183c188
Add
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, I can reproduce the same error using this command `build/sbt "project
core" "test-only *Shuffle*"` locally. The first broken version is 4.0.37.Final.
---
If your project is set up for it, you
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Oh, the allocator is set here:
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java#L95
---
If your project is set
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
@a-roberts could you binary search the first broken netty version? Since
this cannot be reproduced locally, you have to push new commits.
---
If your project is set up for it, you can reply to
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
Still saw the following errors in the unit-test log:
```
16/09/13 07:41:18.817 shuffle-server-466-7 WARN TransportChannelHandler:
Exception in connection from /127.0.0.1:36871
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65322/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65322 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65322/consoleFull)**
for PR 14961 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65322 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65322/consoleFull)**
for PR 14961 at commit
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Had a look to see how to do this
https://github.com/netty/netty/blob/a01519e4f86690323647b5db45d9ffcb184b1a84/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java
so I'll add
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Yep that makes more sense, UnpooledByteBufAllocator usage coming up
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
OK, so same failure with this change. Hm. I don't think it's that something
is just slow but that the error in
https://github.com/apache/spark/pull/14961#issuecomment-245090209 causes netty
to
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
[info] - using external shuffle service *** FAILED *** (1 minute)
[info] java.util.concurrent.TimeoutException: Can't find 2 executors
before 6 milliseconds elapsed
60 seconds
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3256 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3256/consoleFull)**
for PR 14961 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3256 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3256/consoleFull)**
for PR 14961 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65264/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65264 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65264/consoleFull)**
for PR 14961 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #65264 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65264/consoleFull)**
for PR 14961 at commit
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
No new test failures with my runs ranging from Hadoop 2.3 to Hadoop 2.7
today so pushed the commit above
---
If your project is set up for it, you can reply to this email and have your
reply
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Sean, yep, I've had trouble reproducing it too, kicked off a bunch of
builds over the weekend including one using Hadoop-2.3 which was my initial
theory (only difference between our testing
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
@a-roberts are you in a position to add this change to this PR as an
experiment? I can try it on the side too. I can't seem to reproduce the failure
locally, even when fully rebuilding the project
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
In addition, I think we should figure out why upgrading the netty version
will fail. The issue about Recycler seems also in `4.0.29.Final`. Is it because
netty starts to track the memoryprint since
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> I suppose one hacky way to test the theory above is to push a commit here
that sets this in NettyUtils:
Let's add it in `TransportConf` so that it's easy to find since it's the
place of
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
I'm not familiar with netty's Recycler. But the default value of
`io.netty.recycler.maxCapacity` is 262144. This seems too big for Spark anyway.
I don't think we need to cache 260k objects.
---
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, I can't get this test to fail with Netty 4.0.41 when I 'mvn install'
and run the test suite locally. I'm having a hard time seeing what could
alleviate the failure.
I suspect that this
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
I think we can binary search the first broken netty version. It would be
easy to find out the real issue.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
> Is the lesson here to not bother with pooling and use the
UnpooledByteBufAllocator?
Not sure. Pooling is for improving the performance because allocating
direct buffers is pretty slow.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Aha, possibly this:
https://groups.google.com/forum/#!topic/netty/3BoF7q34Z4I
Is the lesson here to not bother with pooling and use the
UnpooledByteBufAllocator?
---
If your project is
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/14961
I saw the error in the log:
```
16/09/05 08:21:56.758 shuffle-server-593-8 WARN TransportChannelHandler:
Exception in connection from /127.0.0.1:44788
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, no I take it back, it's a consistent failure that doesn't show up in
the main test builds (for any Hadoop version):
```
[info] - using external shuffle service *** FAILED *** (1
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3249 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3249/consoleFull)**
for PR 14961 at commit
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Thanks, I did a ctrl-f for "** fail", you'd have a better idea of what the
known flakies are in this farm though, my quick checking:
- using external shuffle service -> looks to be a
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Hm, I see just one in the PR builder here, really. And it's different from
run to run so this could well be spurious. Re-running tests one more time here.
---
If your project is set up for it, you
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3249 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3249/consoleFull)**
for PR 14961 at commit
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
In the description I mentioned that for testing I used "Existing unit tests
against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z
architectures", so clarifying that I only used
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Are you saying thousands of tests fail with certain Hadoop versions and
this version change? That's hard to believe. I'd be very surprised if this
caused a test failure. However I do see this PR
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/14961
Thanks, so are we saying netty 4.0.29 can't be upgraded to 4.0.41 without
breaking changes? That's not even a minor version change...
On branch 1.6 with the netty change for myself I see
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
@jerryshao that's a good point though in theory a maintenance release
contains no API or behavior changes (that aren't bugs). Let's perhaps not touch
1.6 then to be conservative. Hadoop uses a
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/14961
Also many other downstream and upstream applications may also use different
version of Netty jar, it would be better to keep stable for these fundamental
dependences.
---
If your project is set
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/14961
Upgrading Netty version to branch 1.6 may cause API version incompatible
issue for yarn shuffle service, please see
[SPARK-16018](https://issues.apache.org/jira/browse/SPARK-16018) and
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3247 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3247/consoleFull)**
for PR 14961 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3247 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3247/consoleFull)**
for PR 14961 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3246 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3246/consoleFull)**
for PR 14961 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #3246 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3246/consoleFull)**
for PR 14961 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14961
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64938/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #64938 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64938/consoleFull)**
for PR 14961 at commit
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14961
Looks good for master to 1.6
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14961
**[Test build #64938 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64938/consoleFull)**
for PR 14961 at commit
61 matches
Mail list logo