Hi, I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum output file size when writing out from SparkSQL? So far, I have tried: ------xxxxx--------- import sqlContext.implicits._ sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache",true) sc.hadoopConfiguration.setLong("fs.local.block.size",1073741824) sc.hadoopConfiguration.setLong("dfs.blocksize",1073741824) sqlContext.sql("SET spark.sql.shuffle.partitions=2") val df = sqlContext.jsonFile("hdfs://nameservice1/user/joe/samplejson/*") df.saveAsParquetFile("hdfs://nameservice1/user/joe/data/reduceFiles-Parquet") ------xxxxx---------
But my output still isn't aggregated into 1+GB files. Thanks, - Siddhartha