RE: Intermedate stage will be cached automatically ?

2015-06-17 Thread Mark Tse
I think https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence might shed some light on the behaviour you’re seeing. Mark From: canan chen [mailto:ccn...@gmail.com] Sent: June-17-15 5:57 AM To: spark users Subject: Intermedate stage will be cached automatically ? Here's

Unit Testing Spark Transformations/Actions

2015-06-16 Thread Mark Tse
Hi there, I am looking to use Mockito to mock out some functionality while unit testing a Spark application. I currently have code that happily runs on a cluster, but fails when I try to run unit tests against it, throwing a SparkException: org.apache.spark.SparkException: Job aborted due to

ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
Makes sense – I suspect what you suggested should work. However, I think the overhead between this and using `String` would be similar enough to warrant just using `String`. Mark From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: June-11-15 12:58 PM To: Mark Tse Cc: user@spark.apache.org