Hey everyone, I have a Hive table that has a lot of small parquet files and I am creating a data frame out of it to do some processing, but since I have a large number of splits/files my job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. Is this currently possible with Spark?
While exploring whether I can use coalesce I hit another issue. With coalesce I can only control the number of output files not their sizes. And since the total input dataset size can vary significantly in my case, I cannot just use a fixed partition count as the size of each output can get very large. I looked for getting the total input size from an rdd to come up with some heuristic to set the partition count, but I couldn't find any ways to do it. Any help is appreciated. Thanks, Nezih -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/question-about-combining-small-input-splits-tp25440.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org