Why my default partition size is set to 52 ?
Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile(path) and then call count on my data. When I do this, my data is always split in 52 partitions. I don't understand why since I run it on a local machine with 8 cores and the sc.defaultParallelism gives me 8. Even, if I load the file with sc.textFile(path,8), I always get data.partitions.size = 52 I use spark 1.1.1. Any ideas ? Cheers, Jao
Re: Why my default partition size is set to 52 ?
How big is your file? it's probably of a size that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better. On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile(path) and then call count on my data. When I do this, my data is always split in 52 partitions. I don't understand why since I run it on a local machine with 8 cores and the sc.defaultParallelism gives me 8. Even, if I load the file with sc.textFile(path,8), I always get data.partitions.size = 52 I use spark 1.1.1. Any ideas ? Cheers, Jao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why my default partition size is set to 52 ?
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G big and with less bigger file I have a different partitions size. Thanks for this clarification. On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen so...@cloudera.com wrote: How big is your file? it's probably of a size that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better. On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile(path) and then call count on my data. When I do this, my data is always split in 52 partitions. I don't understand why since I run it on a local machine with 8 cores and the sc.defaultParallelism gives me 8. Even, if I load the file with sc.textFile(path,8), I always get data.partitions.size = 52 I use spark 1.1.1. Any ideas ? Cheers, Jao