Hi,

I have a set of input files for a spark program, with each file
corresponding to a logical data partition. What is the API/mechanism to
assign each input file (or a set of files) to a spark partition, when
initializing RDDs?

When i create a spark RDD pointing to the directory of files, my
understanding is it's not guaranteed that each input file will be treated
as separate partition.

My job semantics require that the data is partitioned, and i want to
leverage the partitioning that has already been done, rather than
repartitioning again in the spark job.

I tried to lookup online but haven't found any pointers so far.


Thanks
pala

Reply via email to