Hi Spark people,
I have a Hive table that has a lot of small parquet files and I am
creating a data frame out of it to do some processing, but since I have a
large number of splits/files my job creates a lot of tasks, which I don't
want. Basically what I want is the same functionality that Hive provides,
that is, to combine these small input splits into larger ones by specifying
a max split size setting. Is this currently possible with Spark?

I look at coalesce() but with coalesce I can only control the number
of output files not their sizes. And since the total input dataset size
can vary significantly in my case, I cannot just use a fixed partition
count as the size of each output file can get very large. I then looked for
getting the total input size from an rdd to come up with some heuristic to
set the partition count, but I couldn't find any ways to do it (without
modifying the spark source).

Any help is appreciated.

Thanks,
Nezih

PS: this email is the same as my previous email as I learned that my
previous email ended up as spam for many people since I sent it through
nabble, sorry for the double post.

Reply via email to