Re: hadoop input/output format advanced control

Koert Kuipers Wed, 25 Mar 2015 12:00:41 -0700

yeah fair enough

On Wed, Mar 25, 2015 at 2:41 PM, Patrick Wendell <pwend...@gmail.com> wrote:


> Yeah I agree that might have been nicer, but I think for consistency
> with the input API's maybe we should do the same thing. We can also
> give an example of how to clone sc.hadoopConfiguration and then set
> some new values:
>
> val conf = sc.hadoopConfiguration.clone()
>   .set("k1", "v1")
>   .set("k2", "v2")
>
> val rdd = sc.objectFile(..., conf)
>
> I have no idea if that's the correct syntax, but something like that
> seems almost as easy as passing a hashmap with deltas.
>
> - Patrick
>
> On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers <ko...@tresata.com> wrote:
> > my personal preference would be something like a Map[String, String] that
> > only reflects the changes you want to make the Configuration for the
> given
> > input/output format (so system wide defaults continue to come from
> > sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
> > arbitrary Configuration will work too.
> >
> > i will make a jira and pullreq when i have some time.
> >
> >
> >
> > On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> >>
> >> I see - if you look, in the saving functions we have the option for
> >> the user to pass an arbitrary Configuration.
> >>
> >>
> >>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894
> >>
> >> It seems fine to have the same option for the loading functions, if
> >> it's easy to just pass this config into the input format.
> >>
> >>
> >>
> >> On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >> > the (compression) codec parameter that is now part of many saveAs...
> >> > methods
> >> > came from a similar need. see SPARK-763
> >> > hadoop has many options like this. you either going to have to allow
> >> > many
> >> > more of these optional arguments to all the methods that read from
> >> > hadoop
> >> > inputformats and write to hadoop outputformats, or you force people to
> >> > re-create these methods using HadoopRDD, i think (if thats even
> >> > possible).
> >> >
> >> > On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers <ko...@tresata.com>
> >> > wrote:
> >> >>
> >> >> i would like to use objectFile with some tweaks to the hadoop conf.
> >> >> currently there is no way to do that, except recreating objectFile
> >> >> myself.
> >> >> and some of the code objectFile uses i have no access to, since its
> >> >> private
> >> >> to spark.
> >> >>
> >> >>
> >> >> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell <pwend...@gmail.com
> >
> >> >> wrote:
> >> >>>
> >> >>> Yeah - to Nick's point, I think the way to do this is to pass in a
> >> >>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
> >> >>> field is there). Is there anything you can't do with that feature?
> >> >>>
> >> >>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
> >> >>> <nick.pentre...@gmail.com> wrote:
> >> >>> > Imran, on your point to read multiple files together in a
> partition,
> >> >>> > is
> >> >>> > it
> >> >>> > not simpler to use the approach of copy Hadoop conf and set
> per-RDD
> >> >>> > settings for min split to control the input size per partition,
> >> >>> > together
> >> >>> > with something like CombineFileInputFormat?
> >> >>> >
> >> >>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <
> iras...@cloudera.com>
> >> >>> > wrote:
> >> >>> >
> >> >>> >> I think this would be a great addition, I totally agree that you
> >> >>> >> need
> >> >>> >> to be
> >> >>> >> able to set these at a finer context than just the SparkContext.
> >> >>> >>
> >> >>> >> Just to play devil's advocate, though -- the alternative is for
> you
> >> >>> >> just
> >> >>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then
> >> >>> >> you
> >> >>> >> could
> >> >>> >> expose whatever you need.  Why is this solution better?  IMO the
> >> >>> >> criteria
> >> >>> >> are:
> >> >>> >> (a) common operations
> >> >>> >> (b) error-prone / difficult to implement
> >> >>> >> (c) non-obvious, but important for performance
> >> >>> >>
> >> >>> >> I think this case fits (a) & (c), so I think its still
> worthwhile.
> >> >>> >> But its
> >> >>> >> also worth asking whether or not its too difficult for a user to
> >> >>> >> extend
> >> >>> >> HadoopRDD right now.  There have been several cases in the past
> >> >>> >> week
> >> >>> >> where
> >> >>> >> we've suggested that a user should read from hdfs themselves
> (eg.,
> >> >>> >> to
> >> >>> >> read
> >> >>> >> multiple files together in one partition) -- with*out* reusing
> the
> >> >>> >> code in
> >> >>> >> HadoopRDD, though they would lose things like the metric
> tracking &
> >> >>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need
> to
> >> >>> >> some
> >> >>> >> refactoring to make that easier to do?  Or do we just need a good
> >> >>> >> example?
> >> >>> >>
> >> >>> >> Imran
> >> >>> >>
> >> >>> >> (sorry for hijacking your thread, Koert)
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <
> ko...@tresata.com>
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >> > see email below. reynold suggested i send it to dev instead of
> >> >>> >> > user
> >> >>> >> >
> >> >>> >> > ---------- Forwarded message ----------
> >> >>> >> > From: Koert Kuipers <ko...@tresata.com>
> >> >>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> >>> >> > Subject: hadoop input/output format advanced control
> >> >>> >> > To: "u...@spark.apache.org" <u...@spark.apache.org>
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > currently its pretty hard to control the Hadoop Input/Output
> >> >>> >> > formats
> >> >>> >> > used
> >> >>> >> > in Spark. The conventions seems to be to add extra parameters
> to
> >> >>> >> > all
> >> >>> >> > methods and then somewhere deep inside the code (for example in
> >> >>> >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> >> >>> >> > translated
> >> >>> >> into
> >> >>> >> > settings on the Hadoop Configuration object.
> >> >>> >> >
> >> >>> >> > for example for compression i see "codec: Option[Class[_ <:
> >> >>> >> > CompressionCodec]] = None" added to a bunch of methods.
> >> >>> >> >
> >> >>> >> > how scalable is this solution really?
> >> >>> >> >
> >> >>> >> > for example i need to read from a hadoop dataset and i dont
> want
> >> >>> >> > the
> >> >>> >> input
> >> >>> >> > (part) files to get split up. the way to do this is to set
> >> >>> >> > "mapred.min.split.size". now i dont want to set this at the
> level
> >> >>> >> > of
> >> >>> >> > the
> >> >>> >> > SparkContext (which can be done), since i dont want it to apply
> >> >>> >> > to
> >> >>> >> > input
> >> >>> >> > formats in general. i want it to apply to just this one
> specific
> >> >>> >> > input
> >> >>> >> > dataset i need to read. which leaves me with no options
> >> >>> >> > currently. i
> >> >>> >> could
> >> >>> >> > go add yet another input parameter to all the methods
> >> >>> >> > (SparkContext.textFile, SparkContext.hadoopFile,
> >> >>> >> > SparkContext.objectFile,
> >> >>> >> > etc.). but that seems ineffective.
> >> >>> >> >
> >> >>> >> > why can we not expose a Map[String, String] or some other
> generic
> >> >>> >> > way to
> >> >>> >> > manipulate settings for hadoop input/output formats? it would
> >> >>> >> > require
> >> >>> >> > adding one more parameter to all methods to deal with hadoop
> >> >>> >> > input/output
> >> >>> >> > formats, but after that its done. one parameter to rule them
> >> >>> >> > all....
> >> >>> >> >
> >> >>> >> > then i could do:
> >> >>> >> > val x = sc.textFile("/some/path", formatSettings =
> >> >>> >> > Map("mapred.min.split.size" -> "12345"))
> >> >>> >> >
> >> >>> >> > or
> >> >>> >> > rdd.saveAsTextFile("/some/path, formatSettings =
> >> >>> >> > Map(mapred.output.compress" -> "true",
> >> >>> >> > "mapred.output.compression.codec"
> >> >>> >> ->
> >> >>> >> > "somecodec"))
> >> >>> >> >
> >> >>> >>
> >> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: hadoop input/output format advanced control

Reply via email to