Re: RDD Partition number

2015-02-20 Thread Alessandro Lulli
Hi All, Thanks for your answers. I have one more details to point out. It is clear now how partition number is defined for HDFS file, However, if i have my dataset replicated on all the machines in the same absolute path. In this case each machine has for instance ext3 filesystem. If i load

Re: RDD Partition number

2015-02-19 Thread Ted Yu
What file system are you using ? If you use hdfs, the documentation you cited is pretty clear on how partitions are determined. bq. file X replicated on 4 machines I don't think replication factor plays a role w.r.t. partitions. On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli

Re: RDD Partition number

2015-02-19 Thread Ilya Ganelin
By default you will have (fileSize in Mb / 64) partitions. You can also set the number of partitions when you read in a file with sc.textFile as an optional second parameter. On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu...@di.unipi.it wrote: Hi All, Could you please help me

Re: RDD Partition number

2015-02-19 Thread Ted Yu
bq. *blocks being 64MB by default in HDFS* *In hadoop 2.1+, default block size has been increased.* See https://issues.apache.org/jira/browse/HDFS-4053 Cheers On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu yuzhih...@gmail.com wrote: What file system are you using ? If you use hdfs, the

RE: RDD Partition number

2015-02-19 Thread Ganelin, Ilya
@spark.apache.org Cc: Massimiliano Bertolucci Subject: Re: RDD Partition number By default you will have (fileSize in Mb / 64) partitions. You can also set the number of partitions when you read in a file with sc.textFile as an optional second parameter. On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu