[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263 ] Al M commented on SPARK-5137: - That's right. {code}a{code} has 11 partitions and {code}b{code} has a lot more. I can see why you wouldn't want to force a shuffle on {code}a{code} since that's unnecessary processing. Thanks for your detailed explanation and quick response. I'll close this since I agree that it behaves correctly. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268969#comment-14268969 ] Al M commented on SPARK-5137: - Yes I do mean subtractByKey. Sorry for not being clear. I'm new to Spark and it could be that I just don't understand something correctly. I put below a more detailed description of the results I saw. I have default parallelism set to 160 since I am limited for memory and I am working with a lot of data. * Map is run [11 tasks] * Filter is run [2 tasks] * Join with another RDD and run map [160 tasks] * Jain with another RDD and Map again [160 tasks] * SubtractByKey is run [11 tasks] In the last step I run out of memory because subtractByKey was only split into 11 tasks. If I override the partitions to 160 then it works fine. I thought that subtractByKey would use the default parallelism just like the other tasks after the join. If the expected solution is that I override the partitions in my call, I'm fine with that. So far I managed to avoid setting it in any calls and just setting the default parallelism instead. I was concerned that the behavior observed was part of an actual issue. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269168#comment-14269168 ] Sean Owen commented on SPARK-5137: -- When you run {{a.subtractByKey(b))}} I assume that {{a}} has 11 partitions, hence 11 tasks. That's why {{subtractByKey}} also produces 11 partitions. Otherwise, you would have to shuffle. So far this all sounds like intended behavior. The keys from {{a}} are held in memory, per partition. If that's running out of memory (and I'd double check it's really this step) then yes you do want to force a shuffle with a repartition. That can happen by manually setting the number of partitions or repartitioning {{a}} beforehand. But yeah I think the default is correct here, and consistent with the rest of Spark. You would not want to force a repartition. Look at the other by key methods, right? these don't force the default parallelism. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268061#comment-14268061 ] Sean Owen commented on SPARK-5137: -- (You mean {{subtractByKey}}? that's the one in {{PairRDDFunctions}}. I think this applies to both.) It uses the parallelism of 'self' by default, which seems like a good idea. Forcing the default parallelism by default could mean a pointless shuffle. You can override it if you need to. What should change? subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org