Re: hadoop input/output format advanced control

Patrick Wendell Tue, 24 Mar 2015 12:00:35 -0700

Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?


On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
<[email protected]> wrote:
> Imran, on your point to read multiple files together in a partition, is it
> not simpler to use the approach of copy Hadoop conf and set per-RDD
> settings for min split to control the input size per partition, together
> with something like CombineFileInputFormat?
>
> On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <[email protected]> wrote:
>
>> I think this would be a great addition, I totally agree that you need to be
>> able to set these at a finer context than just the SparkContext.
>>
>> Just to play devil's advocate, though -- the alternative is for you just
>> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
>> expose whatever you need.  Why is this solution better?  IMO the criteria
>> are:
>> (a) common operations
>> (b) error-prone / difficult to implement
>> (c) non-obvious, but important for performance
>>
>> I think this case fits (a) & (c), so I think its still worthwhile.  But its
>> also worth asking whether or not its too difficult for a user to extend
>> HadoopRDD right now.  There have been several cases in the past week where
>> we've suggested that a user should read from hdfs themselves (eg., to read
>> multiple files together in one partition) -- with*out* reusing the code in
>> HadoopRDD, though they would lose things like the metric tracking &
>> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
>> refactoring to make that easier to do?  Or do we just need a good example?
>>
>> Imran
>>
>> (sorry for hijacking your thread, Koert)
>>
>>
>>
>> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <[email protected]> wrote:
>>
>> > see email below. reynold suggested i send it to dev instead of user
>> >
>> > ---------- Forwarded message ----------
>> > From: Koert Kuipers <[email protected]>
>> > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > Subject: hadoop input/output format advanced control
>> > To: "[email protected]" <[email protected]>
>> >
>> >
>> > currently its pretty hard to control the Hadoop Input/Output formats used
>> > in Spark. The conventions seems to be to add extra parameters to all
>> > methods and then somewhere deep inside the code (for example in
>> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
>> into
>> > settings on the Hadoop Configuration object.
>> >
>> > for example for compression i see "codec: Option[Class[_ <:
>> > CompressionCodec]] = None" added to a bunch of methods.
>> >
>> > how scalable is this solution really?
>> >
>> > for example i need to read from a hadoop dataset and i dont want the
>> input
>> > (part) files to get split up. the way to do this is to set
>> > "mapred.min.split.size". now i dont want to set this at the level of the
>> > SparkContext (which can be done), since i dont want it to apply to input
>> > formats in general. i want it to apply to just this one specific input
>> > dataset i need to read. which leaves me with no options currently. i
>> could
>> > go add yet another input parameter to all the methods
>> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
>> > etc.). but that seems ineffective.
>> >
>> > why can we not expose a Map[String, String] or some other generic way to
>> > manipulate settings for hadoop input/output formats? it would require
>> > adding one more parameter to all methods to deal with hadoop input/output
>> > formats, but after that its done. one parameter to rule them all....
>> >
>> > then i could do:
>> > val x = sc.textFile("/some/path", formatSettings =
>> > Map("mapred.min.split.size" -> "12345"))
>> >
>> > or
>> > rdd.saveAsTextFile("/some/path, formatSettings =
>> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
>> ->
>> > "somecodec"))
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: hadoop input/output format advanced control

Reply via email to