Re: mapPartitions - How Does it Works

2015-03-18 Thread Alex Turner (TMS)
List(x.next).iterator is giving you the first element from each partition,
which would be 1, 4 and 7 respectively.

On 3/18/15, 10:19 AM, ashish.usoni ashish.us...@gmail.com wrote:

I am trying to understand about mapPartitions but i am still not sure how
it
works

in the below example it create three partition
val parallel = sc.parallelize(1 to 10, 3)

and when we do below
parallel.mapPartitions( x = List(x.next).iterator).collect

it prints value 
Array[Int] = Array(1, 4, 7)

Can some one please explain why it prints 1,4,7 only

Thanks,




--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does
-it-Works-tp22123.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD pair to pair of RDDs

2015-03-18 Thread Alex Turner (TMS)
What's the best way to go from:

RDD[(A, B)] to (RDD[A], RDD[B])

If I do:

def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2))

Which is the obvious solution, this runs two maps in the cluster.  Can I do 
some kind of a fold instead:

def separate[A, B](l: List[(A, B)]) = l.foldLeft(List[A](), List[B]())((a, b) 
= (b._1 :: a._1, b._2 :: a._2))

But obviously this has an aggregate component that I don't want to be running 
on the driver right?


Thanks,

Alex


Memory Settings for local execution context

2015-03-17 Thread Alex Turner (TMS)
So the page that talks about settings: 
http://spark.apache.org/docs/1.2.1/configuration.html seems to not apply when 
running local contexts.  I have a shell script that starts my job:


xport SPARK_MASTER_OPTS=-Dsun.io.serialization.extendedDebugInfo=true

export SPARK_WORKER_OPTS=-Dsun.io.serialization.extendedDebugInfo=true

/Users/spark/spark/bin/spark-submit \

  --class jobs.MyJob \

  --master local[1] \

  --conf spark.executor.memory=8g \

  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \

  --conf spark.driver.memory=10g \

  --conf 
spark.executor.extraJavaOptions=-Dsun.io.serialization.extendedDebugInfo=true 
\

  target/scala-2.10/my-job.jar


And when I largely remove spark-defaults.conf and spark-env.sh, I get a running 
job that has only 265MB for Memory for an executor!  I have no setting 
specified inside the jar for the SparkConf object as far as I can tell.


How can I get my executor memory up to be nice and big?


Thanks,


Alex