I am facing a very tricky issue here. I have a treeReduce task. The
reduce-function returns a very large object. In fact it is a Map[Int,
Array[Double]]. Each reduce task inserts and/or updates values into the map
or updates the array. My problem is, that this Map can become very large.
Currently,
that guy has cores waiting for work). Am i hallucinating or is that really
the happening? Is there any way I prevent this from happening?
Greetings,
T3L
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Prevent-partitions-from-moving-tp25216.html
Sent from
I was able to solve this by myself. What I did is changing the way spark
computes the partitioning for binary files.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html
Sent from the Apache
I have dataset consisting of 5 binary files (each between 500kb and
2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
cluster are also the workers for Spark. I open the files as a RDD using
sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that
If I have a cluster with 7 nodes, each having an equal amount of cores and
create an RDD with sc.parallelize() it looks as if the Spark will always
tries to distribute the partitions.
Question:
(1) Is that something I can rely on?
(2) Can I rely that sc.parallelize() will assign partitions to as