Hi All, I am running the WikiPedia parsing example present in the "Advance Analytics with Spark" book.
https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112 The partitions of the RDD returned by the readFile function (mentioned above) is of 32MB size. So if my file size is 100 MB, RDD is getting created with 4 partitions with approx 32MB size. I am running this in a standalone spark cluster mode, every thing is working fine only little confused about the nbr of partitions and the size. I want to increase the nbr of partitions for the RDD to make use of the cluster. Is calling repartition() after this the only option or can I pass something in the above method to have more partitions of the RDD. Please let me know. Thanks.