Hello!

 I have a parquet file that has 60MB representing 10millions records.
When I read this file using Spark 2.3.0 and with the
configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29
partitions  as expected.
Code:
 sqlContext.setConf("spark.sql.files.maxPartitionBytes",
Long.toString(2097152));
DataFrame inputDataDf = sqlContext.read().parquet("10Mrecords.parquet");


But when I read the same file with the Spark 1.6.0, the above configuration
will not take effect and I get a single partition. Thus one task that will
do all the processing and no parallelism.

Also, I have use the following configurations without any effect:

Write the parquet file with diffrent size in order to increase the number
of group blocks
 sparkContext.hadoopConfiguration.setLong("parquet.block.size", 1024*50)

 sparkContext.hadoopConfiguration.setLong("mapred.max.split.size", 1024*50)


My question is:
How to achieve the same behavior (to get the desired number of partitions)
when using Spark 1.6 (without repartition method and without any method
that incurs shuffling)?

I look forward for your answers.
 Regards,
  Florin

Reply via email to