[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263
 ] 

Al M commented on SPARK-5137:
-

That's right.  {code}a{code} has 11 partitions and {code}b{code} has a lot 
more.  I can see why you wouldn't want to force a shuffle on {code}a{code} 
since that's unnecessary processing.

Thanks for your detailed explanation and quick response.  I'll close this since 
I agree that it behaves correctly.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268969#comment-14268969
 ] 

Al M commented on SPARK-5137:
-

Yes I do mean subtractByKey.  Sorry for not being clear.

I'm new to Spark and it could be that I just don't understand something 
correctly.  I put below a more detailed description of the results I saw.  I 
have default parallelism set to 160 since I am limited for memory and I am 
working with a lot of data.

* Map is run [11 tasks]
* Filter is run [2 tasks]
* Join with another RDD and run map [160 tasks]
* Jain with another RDD and Map again [160 tasks]
* SubtractByKey is run [11 tasks]

In the last step I run out of memory because subtractByKey was only split into 
11 tasks.  If I override the partitions to 160 then it works fine.  I thought 
that subtractByKey would use the default parallelism just like the other tasks 
after the join.

If the expected solution is that I override the partitions in my call, I'm fine 
with that.  So far I managed to avoid setting it in any calls and just setting 
the default parallelism instead.  I was concerned that the behavior observed 
was part of an actual issue.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269168#comment-14269168
 ] 

Sean Owen commented on SPARK-5137:
--

When you run {{a.subtractByKey(b))}} I assume that {{a}} has 11 partitions, 
hence 11 tasks. That's why {{subtractByKey}} also produces 11 partitions. 
Otherwise, you would have to shuffle. So far this all sounds like intended 
behavior.

The keys from {{a}} are held in memory, per partition. If that's running out of 
memory (and I'd double check it's really this step) then yes you do want to 
force a shuffle with a repartition. That can happen by manually setting the 
number of partitions or repartitioning {{a}} beforehand.

But yeah I think the default is correct here, and consistent with the rest of 
Spark. You would not want to force a repartition. Look at the other by key 
methods, right? these don't force the default parallelism.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268061#comment-14268061
 ] 

Sean Owen commented on SPARK-5137:
--

(You mean {{subtractByKey}}? that's the one in {{PairRDDFunctions}}. I think 
this applies to both.)

It uses the parallelism of 'self' by default, which seems like a good idea. 
Forcing the default parallelism by default could mean a pointless shuffle. You 
can override it if you need to. What should change?

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org