hadoop input/output format advanced control

Koert Kuipers Mon, 23 Mar 2015 13:38:37 -0700

currently its pretty hard to control the Hadoop Input/Output formats used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
settings on the Hadoop Configuration object.


for example for compression i see "codec: Option[Class[_ <:
CompressionCodec]] = None" added to a bunch of methods.

how scalable is this solution really?

for example i need to read from a hadoop dataset and i dont want the input
(part) files to get split up. the way to do this is to set
"mapred.min.split.size". now i dont want to set this at the level of the
SparkContext (which can be done), since i dont want it to apply to input
formats in general. i want it to apply to just this one specific input
dataset i need to read. which leaves me with no options currently. i could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
etc.). but that seems ineffective.

why can we not expose a Map[String, String] or some other generic way to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop input/output
formats, but after that its done. one parameter to rule them all....

then i could do:
val x = sc.textFile("/some/path", formatSettings =
Map("mapred.min.split.size" -> "12345"))

or
rdd.saveAsTextFile("/some/path, formatSettings =
Map(mapred.output.compress" -> "true", "mapred.output.compression.codec" ->
"somecodec"))

hadoop input/output format advanced control

Reply via email to