Re: mapPartitions - How Does it Works
List(x.next).iterator is giving you the first element from each partition, which would be 1, 4 and 7 respectively. On 3/18/15, 10:19 AM, ashish.usoni ashish.us...@gmail.com wrote: I am trying to understand about mapPartitions but i am still not sure how it works in the below example it create three partition val parallel = sc.parallelize(1 to 10, 3) and when we do below parallel.mapPartitions( x = List(x.next).iterator).collect it prints value Array[Int] = Array(1, 4, 7) Can some one please explain why it prints 1,4,7 only Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does -it-Works-tp22123.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD pair to pair of RDDs
What's the best way to go from: RDD[(A, B)] to (RDD[A], RDD[B]) If I do: def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2)) Which is the obvious solution, this runs two maps in the cluster. Can I do some kind of a fold instead: def separate[A, B](l: List[(A, B)]) = l.foldLeft(List[A](), List[B]())((a, b) = (b._1 :: a._1, b._2 :: a._2)) But obviously this has an aggregate component that I don't want to be running on the driver right? Thanks, Alex
Memory Settings for local execution context
So the page that talks about settings: http://spark.apache.org/docs/1.2.1/configuration.html seems to not apply when running local contexts. I have a shell script that starts my job: xport SPARK_MASTER_OPTS=-Dsun.io.serialization.extendedDebugInfo=true export SPARK_WORKER_OPTS=-Dsun.io.serialization.extendedDebugInfo=true /Users/spark/spark/bin/spark-submit \ --class jobs.MyJob \ --master local[1] \ --conf spark.executor.memory=8g \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.driver.memory=10g \ --conf spark.executor.extraJavaOptions=-Dsun.io.serialization.extendedDebugInfo=true \ target/scala-2.10/my-job.jar And when I largely remove spark-defaults.conf and spark-env.sh, I get a running job that has only 265MB for Memory for an executor! I have no setting specified inside the jar for the SparkConf object as far as I can tell. How can I get my executor memory up to be nice and big? Thanks, Alex