(SOLVED) Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-22 Thread t3l
I was able to solve this by myself. What I did is changing the way spark
computes the partitioning for binary files.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-21 Thread Ranadip Chatterjee
T3l,

Did Sean Owen's suggestion help? If not, can you please share the behaviour?

Cheers.
On 20 Oct 2015 11:02 pm, "Lan Jiang" <ljia...@gmail.com> wrote:

> I think the data file is binary per the original post. So in this case,
> sc.binaryFiles should be used. However, I still recommend against using so
> many small binary files as
>
> 1. They are not good for batch I/O
> 2. They put too many memory pressure on namenode.
>
> Lan
>
>
> On Oct 20, 2015, at 11:20 AM, Deenar Toraskar <deenar.toras...@gmail.com>
> wrote:
>
> also check out wholeTextFiles
>
>
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)
>
> On 20 October 2015 at 15:04, Lan Jiang <ljia...@gmail.com> wrote:
>
>> As Francois pointed out, you are encountering a classic small file
>> anti-pattern. One solution I used in the past is to wrap all these small
>> binary files into a sequence file or avro file. For example, the avro
>> schema can have two fields: filename: string and binaryname:byte[]. Thus
>> your file is splittable and will not create so many partitions.
>>
>> Lan
>>
>>
>> On Oct 20, 2015, at 8:03 AM, François Pelletier <
>> newslett...@francoispelletier.org> wrote:
>>
>> You should aggregate your files in larger chunks before doing anything
>> else. HDFS is not fit for small files. It will bloat it and cause you a lot
>> of performance issues. Target a few hundred MB chunks partition size and
>> then save those files back to hdfs and then delete the original ones. You
>> can read, use coalesce and the saveAsXXX on the result.
>>
>> I had the same kind of problem once and solved it in bunching 100's of
>> files together in larger ones. I used text files with bzip2 compression.
>>
>>
>>
>> Le 2015-10-20 08:42, Sean Owen a écrit :
>>
>> coalesce without a shuffle? it shouldn't be an action. It just treats
>> many partitions as one.
>>
>> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t...@threelights.de> wrote:
>>
>>>
>>> I have dataset consisting of 5 binary files (each between 500kb and
>>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>>> cluster are also the workers for Spark. I open the files as a RDD using
>>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
>>> that
>>> involves this RDD, Spark spawns a RDD with more than 3 partitions.
>>> And
>>> this takes ages to process these partitions even if you simply run
>>> "count".
>>> Performing a "repartition" directly after loading does not help, because
>>> Spark seems to insist on materializing the RDD created by binaryFiles
>>> first.
>>>
>>> How I can get around this?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com/>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>>
>
>


Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
As Francois pointed out, you are encountering a classic small file 
anti-pattern. One solution I used in the past is to wrap all these small binary 
files into a sequence file or avro file. For example, the avro schema can have 
two fields: filename: string and binaryname:byte[]. Thus your file is 
splittable and will not create so many partitions.

Lan


> On Oct 20, 2015, at 8:03 AM, François Pelletier 
> <newslett...@francoispelletier.org> wrote:
> 
> You should aggregate your files in larger chunks before doing anything else. 
> HDFS is not fit for small files. It will bloat it and cause you a lot of 
> performance issues. Target a few hundred MB chunks partition size and then 
> save those files back to hdfs and then delete the original ones. You can 
> read, use coalesce and the saveAsXXX on the result.
> 
> I had the same kind of problem once and solved it in bunching 100's of files 
> together in larger ones. I used text files with bzip2 compression.
> 
> 
> 
> Le 2015-10-20 08:42, Sean Owen a écrit :
>> coalesce without a shuffle? it shouldn't be an action. It just treats many 
>> partitions as one.
>> 
>> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t...@threelights.de 
>> <mailto:t...@threelights.de>> wrote:
>> 
>> I have dataset consisting of 5 binary files (each between 500kb and
>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>> cluster are also the workers for Spark. I open the files as a RDD using
>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that
>> involves this RDD, Spark spawns a RDD with more than 3 partitions. And
>> this takes ages to process these partitions even if you simply run "count".
>> Performing a "repartition" directly after loading does not help, because
>> Spark seems to insist on materializing the RDD created by binaryFiles first.
>> 
>> How I can get around this?
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 



Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Sean Owen
coalesce without a shuffle? it shouldn't be an action. It just treats many
partitions as one.

On Tue, Oct 20, 2015 at 1:00 PM, t3l <t...@threelights.de> wrote:

>
> I have dataset consisting of 5 binary files (each between 500kb and
> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
> cluster are also the workers for Spark. I open the files as a RDD using
> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
> that
> involves this RDD, Spark spawns a RDD with more than 3 partitions. And
> this takes ages to process these partitions even if you simply run "count".
> Performing a "repartition" directly after loading does not help, because
> Spark seems to insist on materializing the RDD created by binaryFiles
> first.
>
> How I can get around this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread t3l

I have dataset consisting of 5 binary files (each between 500kb and
2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
cluster are also the workers for Spark. I open the files as a RDD using
sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that
involves this RDD, Spark spawns a RDD with more than 3 partitions. And
this takes ages to process these partitions even if you simply run "count".
Performing a "repartition" directly after loading does not help, because
Spark seems to insist on materializing the RDD created by binaryFiles first.

How I can get around this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Deenar Toraskar
also check out wholeTextFiles

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)

On 20 October 2015 at 15:04, Lan Jiang <ljia...@gmail.com> wrote:

> As Francois pointed out, you are encountering a classic small file
> anti-pattern. One solution I used in the past is to wrap all these small
> binary files into a sequence file or avro file. For example, the avro
> schema can have two fields: filename: string and binaryname:byte[]. Thus
> your file is splittable and will not create so many partitions.
>
> Lan
>
>
> On Oct 20, 2015, at 8:03 AM, François Pelletier <
> newslett...@francoispelletier.org> wrote:
>
> You should aggregate your files in larger chunks before doing anything
> else. HDFS is not fit for small files. It will bloat it and cause you a lot
> of performance issues. Target a few hundred MB chunks partition size and
> then save those files back to hdfs and then delete the original ones. You
> can read, use coalesce and the saveAsXXX on the result.
>
> I had the same kind of problem once and solved it in bunching 100's of
> files together in larger ones. I used text files with bzip2 compression.
>
>
>
> Le 2015-10-20 08:42, Sean Owen a écrit :
>
> coalesce without a shuffle? it shouldn't be an action. It just treats many
> partitions as one.
>
> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t...@threelights.de> wrote:
>
>>
>> I have dataset consisting of 5 binary files (each between 500kb and
>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>> cluster are also the workers for Spark. I open the files as a RDD using
>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
>> that
>> involves this RDD, Spark spawns a RDD with more than 3 partitions. And
>> this takes ages to process these partitions even if you simply run
>> "count".
>> Performing a "repartition" directly after loading does not help, because
>> Spark seems to insist on materializing the RDD created by binaryFiles
>> first.
>>
>> How I can get around this?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
>


Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
I think the data file is binary per the original post. So in this case, 
sc.binaryFiles should be used. However, I still recommend against using so many 
small binary files as 

1. They are not good for batch I/O
2. They put too many memory pressure on namenode.

Lan


> On Oct 20, 2015, at 11:20 AM, Deenar Toraskar <deenar.toras...@gmail.com> 
> wrote:
> 
> also check out wholeTextFiles
> 
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)
>  
> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)>
> 
> On 20 October 2015 at 15:04, Lan Jiang <ljia...@gmail.com 
> <mailto:ljia...@gmail.com>> wrote:
> As Francois pointed out, you are encountering a classic small file 
> anti-pattern. One solution I used in the past is to wrap all these small 
> binary files into a sequence file or avro file. For example, the avro schema 
> can have two fields: filename: string and binaryname:byte[]. Thus your file 
> is splittable and will not create so many partitions.
> 
> Lan
> 
> 
>> On Oct 20, 2015, at 8:03 AM, François Pelletier 
>> <newslett...@francoispelletier.org 
>> <mailto:newslett...@francoispelletier.org>> wrote:
>> 
>> You should aggregate your files in larger chunks before doing anything else. 
>> HDFS is not fit for small files. It will bloat it and cause you a lot of 
>> performance issues. Target a few hundred MB chunks partition size and then 
>> save those files back to hdfs and then delete the original ones. You can 
>> read, use coalesce and the saveAsXXX on the result.
>> 
>> I had the same kind of problem once and solved it in bunching 100's of files 
>> together in larger ones. I used text files with bzip2 compression.
>> 
>> 
>> 
>> Le 2015-10-20 08:42, Sean Owen a écrit :
>>> coalesce without a shuffle? it shouldn't be an action. It just treats many 
>>> partitions as one.
>>> 
>>> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t...@threelights.de 
>>> <mailto:t...@threelights.de>> wrote:
>>> 
>>> I have dataset consisting of 5 binary files (each between 500kb and
>>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>>> cluster are also the workers for Spark. I open the files as a RDD using
>>> sc.binaryFiles("hdfs:///path_to_directory <>").When I run the first action 
>>> that
>>> involves this RDD, Spark spawns a RDD with more than 3 partitions. And
>>> this takes ages to process these partitions even if you simply run "count".
>>> Performing a "repartition" directly after loading does not help, because
>>> Spark seems to insist on materializing the RDD created by binaryFiles first.
>>> 
>>> How I can get around this?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
>>>  
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html>
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>>> <http://nabble.com/>.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> <mailto:user-h...@spark.apache.org>
>>> 
>>> 
>> 
> 
> 



Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread François Pelletier
You should aggregate your files in larger chunks before doing anything
else. HDFS is not fit for small files. It will bloat it and cause you a
lot of performance issues. Target a few hundred MB chunks partition size
and then save those files back to hdfs and then delete the original
ones. You can read, use coalesce and the saveAsXXX on the result.

I had the same kind of problem once and solved it in bunching 100's of
files together in larger ones. I used text files with bzip2 compression.



Le 2015-10-20 08:42, Sean Owen a écrit :
> coalesce without a shuffle? it shouldn't be an action. It just treats
> many partitions as one.
>
> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t...@threelights.de
> <mailto:t...@threelights.de>> wrote:
>
>
> I have dataset consisting of 5 binary files (each between
> 500kb and
> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes
> of the
> cluster are also the workers for Spark. I open the files as a RDD
> using
> sc.binaryFiles("hdfs:///path_to_directory").When I run the first
> action that
> involves this RDD, Spark spawns a RDD with more than 3
> partitions. And
> this takes ages to process these partitions even if you simply run
> "count".
> Performing a "repartition" directly after loading does not help,
> because
> Spark seems to insist on materializing the RDD created by
> binaryFiles first.
>
> How I can get around this?
>
>
>
>     --
> View this message in context:
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org
> <mailto:user-h...@spark.apache.org>
>
>