Memory-efficient successive calls to repartition()

2015-08-20 Thread abellet
Hello,

For the need of my application, I need to periodically "shuffle" the data
across nodes/partitions of a reasonably-large dataset. This is an expensive
operation but I only need to do it every now and then. However it seems that
I am doing something wrong because as the iterations go the memory usage
increases, causing the job to spill onto HDFS, which eventually gets full. I
am also getting some "Lost executor" errors that I don't get if I don't
repartition.

Here's a basic piece of code which reproduces the problem:

data = sc.textFile("ImageNet_gist_train.txt",50).map(parseLine).cache()
data.count()
for i in range(1000):
data=data.repartition(50).persist()
# below several operations are done on data


What am I doing wrong? I tried the following but it doesn't solve the issue:

for i in range(1000):
data2=data.repartition(50).persist()
data2.count() # materialize rdd
data.unpersist() # unpersist previous version
data=data2


Help and suggestions on this would be greatly appreciated! Thanks a lot!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-efficient-successive-calls-to-repartition-tp24358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best way to randomly distribute elements

2015-06-18 Thread abellet
Hello,

In the context of a machine learning algorithm, I need to be able to
randomly distribute the elements of a large RDD across partitions (i.e.,
essentially assign each element to a random partition). How could I achieve
this? I have tried to call repartition() with the current number of
partitions - but it seems to me that this moves only some of the elements,
and in a deterministic way.

I know this will be an expensive operation but I only need to perform it
every once in a while.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-randomly-distribute-elements-tp23391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Best way to randomly distribute elements

2015-06-19 Thread abellet
Thanks a lot for the suggestions!

Le 18/06/2015 15:02, Himanshu Mehra [via Apache Spark User List] a écrit :
> Hi A bellet
>
> You can try RDD.randomSplit(weights array) where a weights array is the
> array of weight you wants to want to put in the consecutive partition
> example RDD.randomSplit(Array(0.7, 0.3)) will create two partitions
> containing 70% data in one and 30% in other, randomly selecting the
> elements. RDD.randomSplit(Array(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
> 0.1, 0.1, )) will create 10 partitions of randomly selected elements
> with equal weights.
>   Thank you
>
>
> Himanshu
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-randomly-distribute-elements-tp23391p23392.html
>
> To unsubscribe from Best way to randomly distribute elements, click here
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-randomly-distribute-elements-tp23391p23409.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Pairwise computations within partition

2015-04-09 Thread abellet
Hello everyone,

I am a Spark novice facing a nontrivial problem to solve with Spark.

I have an RDD consisting of many elements (say, 60K), where each element is
is a d-dimensional vector.

I want to implement an iterative algorithm which does the following. At each
iteration, I want to apply an operation on *pairs* of elements (say, compute
their dot product). Of course the number of pairs is huge, but I only need
to consider a small random subset of the possible pairs at each iteration.

To minimize communication between nodes, I am willing to partition my RDD by
key (where each elements gets a random key) and to only consider pairs of
elements that belong to the same partition (i.e., that share the same key).
But I am not sure how to sample and apply the operation on pairs, and to
make sure that the computation for each pair is indeed done by the node
holding the corresponding elements.

Any help would be greatly appreciated. Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pairwise-computations-within-partition-tp22436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Random pairs / RDD order

2015-04-16 Thread abellet
Hi everyone,

I have a large RDD and I am trying to create a RDD of a random sample of
pairs of elements from this RDD. The elements composing a pair should come
from the same partition for efficiency. The idea I've come up with is to
take two random samples and then use zipPartitions to pair each i-th element
of the first sample with the i-th element of the second sample. Here is a
sample code illustrating the idea:

---
val rdd = sc.parallelize(1 to 6, 16)

val sample1 = rdd.sample(true,0.01,42)
val sample2 = rdd.sample(true,0.01,43)

def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] =
{
  var res = List[String]()
  while (s1.hasNext && s2.hasNext)
  {
val x = s1.next + " " + s2.next
res ::= x
  }
  res.iterator
}

val pairs = sample1.zipPartitions(sample2)(myfunc)
-

However I am not happy with this solution because each element is most
likely to be paired with elements that are "closeby" in the partition. This
is because sample returns an "ordered" Iterator.

Any idea how to fix this? I did not find a way to efficiently shuffle the
random sample so far.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org