Hi all,
I'm trying to run some spark job with spark-shell. What I want to do is
just to count the number of lines in a file.
I start the spark-shell with the default argument i.e just with
./bin/spark-shell.
Load the text file with sc.textFile(path) and then call count on my data.
When I do
How big is your file? it's probably of a size that the Hadoop
InputFormat would make 52 splits for it. Data drives partitions, not
processing resource. Really, 8 splits is the minimum parallelism you
want. Several times your # of cores is better.
On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G
big and with less bigger file I have a different partitions size. Thanks
for this clarification.
On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen so...@cloudera.com wrote:
How big is your file? it's probably of a size that the