Fwd: How to specify number of Partition using newAPIHadoopFile()

Vatsal Patel Tue, 30 Apr 2019 06:21:22 -0700

*Issue: *

When I am reading sequence file in spark, I can specify the number of
partitions as an argument to the API, below is the way
*public <K, V> JavaPairRDD<K, V> sequenceFile(String path, Class<K>
keyClass, Class<V> valueClass, int minPartitions)*


*In newAPIHadoopFile(), this support has been removed. below are the APIs.*

   - public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>>
   JavaPairRDD<K, V> *newAPIHadoopFile*(String path, Class<F> fClass,
   Class<K> kClass, Class<V> vClass, Configuration conf)
   -

   public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>>
   JavaPairRDD<K, V> *newAPIHadoopRDD*(Configuration conf, Class<F> fClass,
   Class<K> kClass, Class<V> vClass)

Is there a way to specify the number of partitions when I will read *Avro*
file using *newAPIHadoopFile*(). I explored and found that we can specify
the hadoop configuration and in that, we can set various Hadoop properties.
but there we can specify the size using this property
("*mapred.max.split.size","50mb").
*based on this it will calculate the number of partitions but then each
partition's size may or may not be equal or less than the specified size.

   - *note - *A way other than repartition()

*Execution Environment*

   - SPARK-JAVA VERSION - 2.4.0
   - JDK VERSION - 1.8
   - SPARK ARTIFACTID - spark-core_2.11
   - AVRO VERSION -  1.8.2

Please help us understand, why this issue is coming?

Thanks,
Vatsal

Fwd: How to specify number of Partition using newAPIHadoopFile()

Reply via email to