I think this would be a great addition, I totally agree that you need to be
able to set these at a finer context than just the SparkContext.

Just to play devil's advocate, though -- the alternative is for you just
subclass HadoopRDD yourself, or make a totally new RDD, and then you could
expose whatever you need.  Why is this solution better?  IMO the criteria
are:
(a) common operations
(b) error-prone / difficult to implement
(c) non-obvious, but important for performance

I think this case fits (a) & (c), so I think its still worthwhile.  But its
also worth asking whether or not its too difficult for a user to extend
HadoopRDD right now.  There have been several cases in the past week where
we've suggested that a user should read from hdfs themselves (eg., to read
multiple files together in one partition) -- with*out* reusing the code in
HadoopRDD, though they would lose things like the metric tracking &
preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
refactoring to make that easier to do?  Or do we just need a good example?

Imran

(sorry for hijacking your thread, Koert)



On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <ko...@tresata.com> wrote:

> see email below. reynold suggested i send it to dev instead of user
>
> ---------- Forwarded message ----------
> From: Koert Kuipers <ko...@tresata.com>
> Date: Mon, Mar 23, 2015 at 4:36 PM
> Subject: hadoop input/output format advanced control
> To: "u...@spark.apache.org" <u...@spark.apache.org>
>
>
> currently its pretty hard to control the Hadoop Input/Output formats used
> in Spark. The conventions seems to be to add extra parameters to all
> methods and then somewhere deep inside the code (for example in
> PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
> settings on the Hadoop Configuration object.
>
> for example for compression i see "codec: Option[Class[_ <:
> CompressionCodec]] = None" added to a bunch of methods.
>
> how scalable is this solution really?
>
> for example i need to read from a hadoop dataset and i dont want the input
> (part) files to get split up. the way to do this is to set
> "mapred.min.split.size". now i dont want to set this at the level of the
> SparkContext (which can be done), since i dont want it to apply to input
> formats in general. i want it to apply to just this one specific input
> dataset i need to read. which leaves me with no options currently. i could
> go add yet another input parameter to all the methods
> (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
> etc.). but that seems ineffective.
>
> why can we not expose a Map[String, String] or some other generic way to
> manipulate settings for hadoop input/output formats? it would require
> adding one more parameter to all methods to deal with hadoop input/output
> formats, but after that its done. one parameter to rule them all....
>
> then i could do:
> val x = sc.textFile("/some/path", formatSettings =
> Map("mapred.min.split.size" -> "12345"))
>
> or
> rdd.saveAsTextFile("/some/path, formatSettings =
> Map(mapred.output.compress" -> "true", "mapred.output.compression.codec" ->
> "somecodec"))
>

Reply via email to