Hello there,
I am trying to understand the difference between the following
reparition()...
a. def repartition(partitionExprs: Column*): Dataset[T]
b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
c. def repartition(numPartitions: Int): Dataset[T]
My understanding is th
Hi,
it is possible to control the number of partitions for the RDD without
calling repartition by setting the max split size for the hadoop input
format used. Tracing through the code, XmlInputFormat extends
FileInputFormat which determines the number of splits (which NewHadoopRdd
uses to determin
Hi All,
I am running the WikiPedia parsing example present in the "Advance
Analytics with Spark" book.
https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112
The partitions of the RDD returned by