Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
OK by me except those branches are about to cut maintenance releases. I'd
wait for right now I suppose, but, do not feel strongly about it if someone
wants to take responsibility for merging it now.
Github user Tagar commented on the issue:
https://github.com/apache/spark/pull/21456
@srowen can this be backported to 2.2 and 2.3 master branches as well?
Thank you @countmdm
---
-
To unsubscribe, e-mail:
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
Merged to master. Probably Ok to backport if anyone feels strongly about
it, but, let's let 2.3.1 and other imminent releases go out first.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21456
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91362/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21456
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21456
**[Test build #91362 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)**
for PR 21456 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21456
**[Test build #91362 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)**
for PR 21456 at commit
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
Just modified the code to use regexp and pushed the updates.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
I believe this would work, if the goal is simply to squash repeated slashes
into one:
```
private static final Pattern MULTIPLE_SEPARATORS =
Pattern.compile(File.separator + "{2,}");
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
Ok, if you believe this is not a performance problem here, then it's fine
with me. To save us some possible further bouncing of this review, can you
please share here your pattern/regex code? I
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
Hm, I guess I was supposing that, in the context of a file stream, whatever
I/O is done with it is going to be much more significant than string
manipulation of its path. Would this really be
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
@squito wrt. the added code for path normalization: exactly as you say.
This is just a precaution in case spark (or even some code that above spark)
ends up generating pathnames that contain
Github user squito commented on the issue:
https://github.com/apache/spark/pull/21456
Re: normalization -- if I understand correctly, its not that you know that
the normalization definitely *does* change the strings for the heap dump you
have. Its just to make sure that your change
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
OK I get it, it's not that different forms of the same path are formed, but
that they're not canonical to begin with and so discarded by the File
internals. And this logic isn't the full
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
If we don't do normalization ourselves, we may potentially run into the
following:
path = ... // Produces "foo//bar"
path = path.intern(); // Ok, no separate copies of "foo//bar"
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
OK, that's good evidence this is causing a lot of garbage. Although yes
ideally we figure out just why there are so many of these stream objects, I can
see optimizing this as it is. Yes I understand
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
@srowen yes, I am pretty sure that this code generates all these duplicate
objects. I've analyzed a heap dump from a real customer, so I cannot publish
the entire jxray report, since it may
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/21456
I guess I'm just very surprised if this single line is responsible for 10%
of objects on the heap - are you sure? Can we otherwise optimize it elsewhere
in the code? I am also not sure why this line
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
I confirm that the -XX:+UseStringDeduplication option is available only
with G1 GC, and it is off by default. So if we decide to use it, I guess we
won't be able to enforce it reliably, especially
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21456
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91321/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21456
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21456
**[Test build #91321 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)**
for PR 21456 at commit
Github user squito commented on the issue:
https://github.com/apache/spark/pull/21456
From my understanding, that option is only available with G1GC, which is
not really a good fit for spark (forget the exact details but something about
humongous allocations which are common with all
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21456
**[Test build #91321 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)**
for PR 21456 at commit
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21456
ok to test
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21456
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user squito commented on the issue:
https://github.com/apache/spark/pull/21456
Ok to test
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
Yes.
On Tue, May 29, 2018 at 1:18 PM, UCB AMPLab
wrote:
> Can one of the admins verify this patch?
>
> â
> You are receiving this because you authored the
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21456
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
29 matches
Mail list logo