[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-06-06 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
OK by me except those branches are about to cut maintenance releases. I'd 
wait for right now I suppose, but, do not feel strongly about it if someone 
wants to take responsibility for merging it now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-06-06 Thread Tagar
Github user Tagar commented on the issue:

https://github.com/apache/spark/pull/21456
  
@srowen can this be backported to 2.2 and 2.3 master branches as well? 
Thank you @countmdm 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-06-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
Merged to master. Probably Ok to backport if anyone feels strongly about 
it, but, let's let 2.3.1 and other imminent releases go out first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91362/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21456
  
**[Test build #91362 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)**
 for PR 21456 at commit 
[`f49512b`](https://github.com/apache/spark/commit/f49512b61a6d5b73f4d5966c557867ee55be0b5a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21456
  
**[Test build #91362 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)**
 for PR 21456 at commit 
[`f49512b`](https://github.com/apache/spark/commit/f49512b61a6d5b73f4d5966c557867ee55be0b5a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
Just modified the code to use regexp and pushed the updates.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
I believe this would work, if the goal is simply to squash repeated slashes 
into one:

```
private static final Pattern MULTIPLE_SEPARATORS = 
Pattern.compile(File.separator + "{2,}");
private static String createNormalizedInternedPathname(String dir1, String 
dir2, String fname) {
  Matcher m = MULTIPLE_SEPARATORS.matcher(dir1 + File.separator + dir2 + 
File.separator + fname);
  return m.replaceAll("/").intern();
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
Ok, if you believe this is not a performance problem here, then it's fine 
with me. To save us some possible further bouncing of this review, can you 
please share here your pattern/regex code? I will then change my patch and 
re-push it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
Hm, I guess I was supposing that, in the context of a file stream, whatever 
I/O is done with it is going to be much more significant than string 
manipulation of its path. Would this really be called, say, a million times in 
a short period? I ran a quick benchmark to sense-check and looks like those 
million replacements take all of a second of CPU time. (This is with a 
precompiled Pattern)  It's minor here, but that's nontrivial extra code to get 
right and read and maintain, if that's all it's doing. Yes if it were an 
extreme hotspot it could be called for.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
@squito wrt. the added code for path normalization: exactly as you say. 
This is just a precaution in case spark (or even some code that above spark) 
ends up generating pathnames that contain repeated slashes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/21456
  
Re: normalization -- if I understand correctly, its not that you know that 
the normalization definitely *does* change the strings for the heap dump you 
have.  Its just to make sure that your change is effective even if 
normalization were to change things.  In practice, I don't think spark's usage 
should lead to any de-normalized paths, but I think its a good precaution.

Re: so many objects.  I don't think its that surprising actually.  Imagine 
a shuffle on a large cluster writing to 10k partitions.  The shuffle-read side 
is going to make a lot of simultaneous requests to the same shuffle-write side 
task -- all that data lives in the same file, just at different offsets.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
OK I get it, it's not that different forms of the same path are formed, but 
that they're not canonical to begin with and so discarded by the File 
internals. And this logic isn't the full canonicalization logic reimplemented, 
but just eliminating successive "/"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
If we don't do normalization ourselves, we may potentially run into the 
following: 

path = ...  // Produces "foo//bar"
path = path.intern();  // Ok, no separate copies of "foo//bar" anymore
File f = new Fille(path);
// Internally, code in File() constructor looks at the given path.
// If path is not in the normalized form, it normalizes it, producing a new 
stirng "foo/bar"
System.out.println(f.getPath());  // Prints "foo/bar"

Since the code inside java.io.File doesn't do any string interning, we will 
keep canonicalizing the "foo//bar" string, but then java.io.File will still 
generate multiple copies of "foo/bar".



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
OK, that's good evidence this is causing a lot of garbage. Although yes 
ideally we figure out just why there are so many of these stream objects, I can 
see optimizing this as it is. Yes I understand the intern issue here, but, not 
the need for normalization. It sounded like the strings were already mostly 
identical?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
@srowen yes, I am pretty sure that this code generates all these duplicate 
objects. I've analyzed a heap dump from a real customer, so I cannot publish 
the entire jxray report, since it may potentially reveal some sensitive data. 
But I've just attached a screenshot of the part that covers duplicate strings 
to https://issues.apache.org/jira/browse/SPARK-24356 - hopefully it will give 
you a better idea of what's going on here. Note that the biggest part of the 
overhead (10.5%) come from FileInputStream.path strings. The FileInputStream 
objects themselves are unreachable (but they apparently stay in memory for long 
enough - if they were GCed immediately, they wouldn't be reflected in the heap 
dump). The sample of paths in FileInputStream.path look identical to the sample 
of File.paths. This, plus looking at the code, makes me think that the spark 
code in question generates a large number of File objects with a small number 
of unique paths, and then generates an even larger number of
  FileInputStream objects with the same paths.

I don't think it can be optimized somewhere else, short of accessing and 
interning paths directly in the File objects using Java reflection. As to why 
this produces many different copies of identical Strings - consider the 
following simple code:

String s1 = "foo";
String bar = "bar";
String s1 = foo + "/" + bar;
String s2 = foo + "/" + bar;
System.out.println(s1.equals(s2));
System.out.println(s1 == s2);

This will print
true
false

That is, s1 and s2 have the same contents, but they are two different 
objects, taking twice the memory.

In the spark code in question we have the same main problem: by 
concatenating identical initial strings, it keeps generating multiple copies of 
identical "derived" strings. To remove duplicates among these derived strings, 
we need to apply String.intern(). However, to make sure that the code in 
java.io.File will not subsequently change our canonicalized strings if they are 
not in the normalized form, we need to do this normalization ourselves as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21456
  
I guess I'm just very surprised if this single line is responsible for 10% 
of objects on the heap - are you sure? Can we otherwise optimize it elsewhere 
in the code? I am also not sure why this line would produce different paths for 
the same canonical path? because that's what's driving adding all this 
normalization logic.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
I confirm that the -XX:+UseStringDeduplication option is available only 
with G1 GC, and it is off by default. So if we decide to use it, I guess we 
won't be able to enforce it reliably, especially for applications that just use 
some library code from Spark (by the way, this issue was found in Yarn Node 
Manager). Another issue with the string deduplication in G1 is that it's not 
aggressive at all. It's done by one thread, that scans the entire heap when 
other threads are loaded lightly enough. In the case that we investigated, the 
number of duplicate strings was high, yet they were relatively short-lived. 
That is, they survived for enough time to create significant pressure on the 
GC, but I don't think that this timeframe would be enough for the deduplication 
thread to eliminate them.

In summary, this kind of targeted explicit string deduplication is not 
uncommon at all, and works really well. Usually you just need to add the 
.intern() call in a few places in the code. What I had to do here is quite 
involved because of the extra problem with java.io.File.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91321/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21456
  
**[Test build #91321 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)**
 for PR 21456 at commit 
[`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/21456
  
From my understanding, that option is only available with G1GC, which is 
not really a good fit for spark (forget the exact details but something about 
humongous allocations which are common with all the large byte buffers typical 
for spark).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21456
  
**[Test build #91321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)**
 for PR 21456 at commit 
[`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21456
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/21456
  
Ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread countmdm
Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
Yes.

On Tue, May 29, 2018 at 1:18 PM, UCB AMPLab 
wrote:

> Can one of the admins verify this patch?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org