[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-06-06 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 OK by me except those branches are about to cut maintenance releases. I'd wait for right now I suppose, but, do not feel strongly about it if someone wants to take responsibility for merging it now.

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-06-06 Thread Tagar
Github user Tagar commented on the issue: https://github.com/apache/spark/pull/21456 @srowen can this be backported to 2.2 and 2.3 master branches as well? Thank you @countmdm --- - To unsubscribe, e-mail:

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-06-02 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 Merged to master. Probably Ok to backport if anyone feels strongly about it, but, let's let 2.3.1 and other imminent releases go out first. ---

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91362/ Test PASSed. ---

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21456 **[Test build #91362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)** for PR 21456 at commit

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21456 **[Test build #91362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)** for PR 21456 at commit

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 Just modified the code to use regexp and pushed the updates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 I believe this would work, if the goal is simply to squash repeated slashes into one: ``` private static final Pattern MULTIPLE_SEPARATORS = Pattern.compile(File.separator + "{2,}");

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 Ok, if you believe this is not a performance problem here, then it's fine with me. To save us some possible further bouncing of this review, can you please share here your pattern/regex code? I

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 Hm, I guess I was supposing that, in the context of a file stream, whatever I/O is done with it is going to be much more significant than string manipulation of its path. Would this really be

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 @squito wrt. the added code for path normalization: exactly as you say. This is just a precaution in case spark (or even some code that above spark) ends up generating pathnames that contain

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/21456 Re: normalization -- if I understand correctly, its not that you know that the normalization definitely *does* change the strings for the heap dump you have. Its just to make sure that your change

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 OK I get it, it's not that different forms of the same path are formed, but that they're not canonical to begin with and so discarded by the File internals. And this logic isn't the full

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 If we don't do normalization ourselves, we may potentially run into the following: path = ... // Produces "foo//bar" path = path.intern(); // Ok, no separate copies of "foo//bar"

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 OK, that's good evidence this is causing a lot of garbage. Although yes ideally we figure out just why there are so many of these stream objects, I can see optimizing this as it is. Yes I understand

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 @srowen yes, I am pretty sure that this code generates all these duplicate objects. I've analyzed a heap dump from a real customer, so I cannot publish the entire jxray report, since it may

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21456 I guess I'm just very surprised if this single line is responsible for 10% of objects on the heap - are you sure? Can we otherwise optimize it elsewhere in the code? I am also not sure why this line

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 I confirm that the -XX:+UseStringDeduplication option is available only with G1 GC, and it is off by default. So if we decide to use it, I guess we won't be able to enforce it reliably, especially

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91321/ Test PASSed. ---

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21456 **[Test build #91321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)** for PR 21456 at commit

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/21456 From my understanding, that option is only available with G1GC, which is not really a good fit for spark (forget the exact details but something about humongous allocations which are common with all

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21456 **[Test build #91321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)** for PR 21456 at commit

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21456 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/21456 Ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 Yes. On Tue, May 29, 2018 at 1:18 PM, UCB AMPLab wrote: > Can one of the admins verify this patch? > > — > You are receiving this because you authored the

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21456 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional