Hello! I have a parquet file that has 60MB representing 10millions records. When I read this file using Spark 2.3.0 and with the configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29 partitions as expected. Code: sqlContext.setConf("spark.sql.files.maxPartitionBytes", Long.toString(2097152)); DataFrame inputDataDf = sqlContext.read().parquet("10Mrecords.parquet");
But when I read the same file with the Spark 1.6.0, the above configuration will not take effect and I get a single partition. Thus one task that will do all the processing and no parallelism. Also, I have use the following configurations without any effect: Write the parquet file with diffrent size in order to increase the number of group blocks sparkContext.hadoopConfiguration.setLong("parquet.block.size", 1024*50) sparkContext.hadoopConfiguration.setLong("mapred.max.split.size", 1024*50) My question is: How to achieve the same behavior (to get the desired number of partitions) when using Spark 1.6 (without repartition method and without any method that incurs shuffling)? I look forward for your answers. Regards, Florin