I'm processing about 10GB of tab delimited rawdata with a few fields (page and user id along with timestamp when user viewed the page) using a 40 node cluster and using SparkSQL to compute the number of unique visitors per page at various intervals. I'm currently just reading the data as sc.textFile() which returns the RDD to register and run the sql query against the table.
Since the RDD is already a partitioned collection, wanted to check if there are examples on how to use partitioner or repartition on the RDD to see if they will help improve the performance further. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-partitioner-or-repartition-examples-tp11813.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org