Hi All,

I am running the WikiPedia parsing example present in the "Advance
Analytics with Spark" book.

https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112


The partitions of the RDD returned by the readFile function (mentioned
above) is of 32MB size. So if my file size is 100 MB, RDD is getting
created with 4 partitions with approx 32MB  size.


I am running this in a standalone spark cluster mode, every thing is
working fine only little confused about the nbr of partitions and the size.

I want to increase the nbr of partitions for the RDD to make use of the
cluster. Is calling repartition() after this the only option or can I pass
something in the above method to have more partitions of the RDD.

Please let me know.

Thanks.

Reply via email to