I was able to solve this by myself. What I did is changing the way spark
computes the partitioning for binary files.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html
Sent from the Apache
RDD with more than 3 partitions.
>>> And
>>> this takes ages to process these partitions even if you simply run
>>> "count".
>>> Performing a "repartition" directly after loading does
e partitions even if you simply run "count".
>> Performing a "repartition" directly after loading does not help, because
>> Spark seems to insist on materializing the RDD created by binaryFiles first.
>>
>> How I can get around this?
>>
>>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
D created by binaryFiles first.
How I can get around this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
Sent from the Apache Spark User List mail
;> Performing a "repartition" directly after loading does not help, because
>> Spark seems to insist on materializing the RDD created by binaryFiles
>> first.
>>
>> How I can get around this?
>>
>>
&
> that
>>> involves this RDD, Spark spawns a RDD with more than 3 partitions. And
>>> this takes ages to process these partitions even if you simply run "count".
>>> Performing a "repartition" directly after loading does not help, because
>>&g
ing the RDD created by
> binaryFiles first.
>
> How I can get around this?
>
>
>
> --
> View this message in context:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spa