Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs?
When i create a spark RDD pointing to the directory of files, my understanding is it's not guaranteed that each input file will be treated as separate partition. My job semantics require that the data is partitioned, and i want to leverage the partitioning that has already been done, rather than repartitioning again in the spark job. I tried to lookup online but haven't found any pointers so far. Thanks pala