Hi I have the following code where I read ORC files from HDFS and it loads directory which contains 12 ORC files. Now since HDFS directory contains 12 files it will create 12 partitions by default. These directory is huge and when ORC files gets decompressed it becomes around 10 GB how do I increase partitions for the below code so that my Spark job runs faster and does not hang for long time because of reading 10 GB files through shuffle in 12 partitions. Please guide.
DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/"); df.select().groupby(..) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org