Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all,

I'm trying to run some spark job with spark-shell. What I want to do is
just to count the number of lines in a file.
I start the spark-shell with the default argument i.e just with
./bin/spark-shell.

Load the text file with sc.textFile(path) and then call count on my data.

When I do this, my data is always split in 52 partitions. I don't
understand why since I run it on a local machine with 8 cores and the
sc.defaultParallelism gives me 8.

Even, if I load the file with sc.textFile(path,8), I always get
data.partitions.size = 52

I use spark 1.1.1.


Any ideas ?



Cheers,

Jao


Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Sean Owen
How big is your file? it's probably of a size that the Hadoop
InputFormat would make 52 splits for it. Data drives partitions, not
processing resource. Really, 8 splits is the minimum parallelism you
want. Several times your # of cores is better.

On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 Hi all,

 I'm trying to run some spark job with spark-shell. What I want to do is just
 to count the number of lines in a file.
 I start the spark-shell with the default argument i.e just with
 ./bin/spark-shell.

 Load the text file with sc.textFile(path) and then call count on my data.

 When I do this, my data is always split in 52 partitions. I don't understand
 why since I run it on a local machine with 8 cores and the
 sc.defaultParallelism gives me 8.

 Even, if I load the file with sc.textFile(path,8), I always get
 data.partitions.size = 52

 I use spark 1.1.1.


 Any ideas ?



 Cheers,

 Jao


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G
big and with less bigger file I have a different partitions size. Thanks
for this clarification.

On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen so...@cloudera.com wrote:

 How big is your file? it's probably of a size that the Hadoop
 InputFormat would make 52 splits for it. Data drives partitions, not
 processing resource. Really, 8 splits is the minimum parallelism you
 want. Several times your # of cores is better.

 On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Hi all,
 
  I'm trying to run some spark job with spark-shell. What I want to do is
 just
  to count the number of lines in a file.
  I start the spark-shell with the default argument i.e just with
  ./bin/spark-shell.
 
  Load the text file with sc.textFile(path) and then call count on my
 data.
 
  When I do this, my data is always split in 52 partitions. I don't
 understand
  why since I run it on a local machine with 8 cores and the
  sc.defaultParallelism gives me 8.
 
  Even, if I load the file with sc.textFile(path,8), I always get
  data.partitions.size = 52
 
  I use spark 1.1.1.
 
 
  Any ideas ?
 
 
 
  Cheers,
 
  Jao