subject:"hadoop input\/output format advanced control"

Re: hadoop input/output format advanced control

2015-03-25 Thread Aaron Davidson

Should we mention that you should synchronize
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race
condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)

On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell  wrote:

> Great - that's even easier. Maybe we could have a simple example in the
> doc.
>
> On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza 
> wrote:
> > Regarding Patrick's question, you can just do "new
> Configuration(oldConf)"
> > to get a cloned Configuration object and add any new properties to it.
> >
> > -Sandy
> >
> > On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid 
> wrote:
> >
> >> Hi Nick,
> >>
> >> I don't remember the exact details of these scenarios, but I think the
> user
> >> wanted a lot more control over how the files got grouped into
> partitions,
> >> to group the files together by some arbitrary function.  I didn't think
> >> that was possible w/ CombineFileInputFormat, but maybe there is a way?
> >>
> >> thanks
> >>
> >> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath <
> nick.pentre...@gmail.com>
> >> wrote:
> >>
> >> > Imran, on your point to read multiple files together in a partition,
> is
> >> it
> >> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> >> > settings for min split to control the input size per partition,
> together
> >> > with something like CombineFileInputFormat?
> >> >
> >> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> >> > wrote:
> >> >
> >> > > I think this would be a great addition, I totally agree that you
> need
> >> to
> >> > be
> >> > > able to set these at a finer context than just the SparkContext.
> >> > >
> >> > > Just to play devil's advocate, though -- the alternative is for you
> >> just
> >> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> >> > could
> >> > > expose whatever you need.  Why is this solution better?  IMO the
> >> criteria
> >> > > are:
> >> > > (a) common operations
> >> > > (b) error-prone / difficult to implement
> >> > > (c) non-obvious, but important for performance
> >> > >
> >> > > I think this case fits (a) & (c), so I think its still worthwhile.
> But
> >> > its
> >> > > also worth asking whether or not its too difficult for a user to
> extend
> >> > > HadoopRDD right now.  There have been several cases in the past week
> >> > where
> >> > > we've suggested that a user should read from hdfs themselves (eg.,
> to
> >> > read
> >> > > multiple files together in one partition) -- with*out* reusing the
> code
> >> > in
> >> > > HadoopRDD, though they would lose things like the metric tracking &
> >> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> >> some
> >> > > refactoring to make that easier to do?  Or do we just need a good
> >> > example?
> >> > >
> >> > > Imran
> >> > >
> >> > > (sorry for hijacking your thread, Koert)
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
> >> > wrote:
> >> > >
> >> > > > see email below. reynold suggested i send it to dev instead of
> user
> >> > > >
> >> > > > -- Forwarded message --
> >> > > > From: Koert Kuipers 
> >> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> > > > Subject: hadoop input/output format advanced control
> >> > > > To: "u...@spark.apache.org" 
> >> > > >
> >> > > >
> >> > > > currently its pretty hard to control the Hadoop Input/Output
> formats
> >> > used
> >> > > > in Spark. The conventions seems to be to add extra parameters to
> all
> >> > > > methods and then somewhere deep inside the code (for example in
> >> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> >> translated
> >> > > into
> >> > > > settings on the Hadoop Configuration object.
> >> > > >
> >> >

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell

Great - that's even easier. Maybe we could have a simple example in the doc.

On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza  wrote:
> Regarding Patrick's question, you can just do "new Configuration(oldConf)"
> to get a cloned Configuration object and add any new properties to it.
>
> -Sandy
>
> On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid  wrote:
>
>> Hi Nick,
>>
>> I don't remember the exact details of these scenarios, but I think the user
>> wanted a lot more control over how the files got grouped into partitions,
>> to group the files together by some arbitrary function.  I didn't think
>> that was possible w/ CombineFileInputFormat, but maybe there is a way?
>>
>> thanks
>>
>> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
>> wrote:
>>
>> > Imran, on your point to read multiple files together in a partition, is
>> it
>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> > settings for min split to control the input size per partition, together
>> > with something like CombineFileInputFormat?
>> >
>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> > wrote:
>> >
>> > > I think this would be a great addition, I totally agree that you need
>> to
>> > be
>> > > able to set these at a finer context than just the SparkContext.
>> > >
>> > > Just to play devil's advocate, though -- the alternative is for you
>> just
>> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
>> > could
>> > > expose whatever you need.  Why is this solution better?  IMO the
>> criteria
>> > > are:
>> > > (a) common operations
>> > > (b) error-prone / difficult to implement
>> > > (c) non-obvious, but important for performance
>> > >
>> > > I think this case fits (a) & (c), so I think its still worthwhile.  But
>> > its
>> > > also worth asking whether or not its too difficult for a user to extend
>> > > HadoopRDD right now.  There have been several cases in the past week
>> > where
>> > > we've suggested that a user should read from hdfs themselves (eg., to
>> > read
>> > > multiple files together in one partition) -- with*out* reusing the code
>> > in
>> > > HadoopRDD, though they would lose things like the metric tracking &
>> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>> some
>> > > refactoring to make that easier to do?  Or do we just need a good
>> > example?
>> > >
>> > > Imran
>> > >
>> > > (sorry for hijacking your thread, Koert)
>> > >
>> > >
>> > >
>> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>> > wrote:
>> > >
>> > > > see email below. reynold suggested i send it to dev instead of user
>> > > >
>> > > > -- Forwarded message --
>> > > > From: Koert Kuipers 
>> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > > > Subject: hadoop input/output format advanced control
>> > > > To: "u...@spark.apache.org" 
>> > > >
>> > > >
>> > > > currently its pretty hard to control the Hadoop Input/Output formats
>> > used
>> > > > in Spark. The conventions seems to be to add extra parameters to all
>> > > > methods and then somewhere deep inside the code (for example in
>> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
>> translated
>> > > into
>> > > > settings on the Hadoop Configuration object.
>> > > >
>> > > > for example for compression i see "codec: Option[Class[_ <:
>> > > > CompressionCodec]] = None" added to a bunch of methods.
>> > > >
>> > > > how scalable is this solution really?
>> > > >
>> > > > for example i need to read from a hadoop dataset and i dont want the
>> > > input
>> > > > (part) files to get split up. the way to do this is to set
>> > > > "mapred.min.split.size". now i dont want to set this at the level of
>> > the
>> > > > SparkContext (which can be done), since i dont want it to apply to
>> > input
>> > > > formats in general. i want it to apply to just this one specific
>> input
>

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza

Regarding Patrick's question, you can just do "new Configuration(oldConf)"
to get a cloned Configuration object and add any new properties to it.

-Sandy

On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid  wrote:

> Hi Nick,
>
> I don't remember the exact details of these scenarios, but I think the user
> wanted a lot more control over how the files got grouped into partitions,
> to group the files together by some arbitrary function.  I didn't think
> that was possible w/ CombineFileInputFormat, but maybe there is a way?
>
> thanks
>
> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
> wrote:
>
> > Imran, on your point to read multiple files together in a partition, is
> it
> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> > settings for min split to control the input size per partition, together
> > with something like CombineFileInputFormat?
> >
> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> > wrote:
> >
> > > I think this would be a great addition, I totally agree that you need
> to
> > be
> > > able to set these at a finer context than just the SparkContext.
> > >
> > > Just to play devil's advocate, though -- the alternative is for you
> just
> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> > could
> > > expose whatever you need.  Why is this solution better?  IMO the
> criteria
> > > are:
> > > (a) common operations
> > > (b) error-prone / difficult to implement
> > > (c) non-obvious, but important for performance
> > >
> > > I think this case fits (a) & (c), so I think its still worthwhile.  But
> > its
> > > also worth asking whether or not its too difficult for a user to extend
> > > HadoopRDD right now.  There have been several cases in the past week
> > where
> > > we've suggested that a user should read from hdfs themselves (eg., to
> > read
> > > multiple files together in one partition) -- with*out* reusing the code
> > in
> > > HadoopRDD, though they would lose things like the metric tracking &
> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> some
> > > refactoring to make that easier to do?  Or do we just need a good
> > example?
> > >
> > > Imran
> > >
> > > (sorry for hijacking your thread, Koert)
> > >
> > >
> > >
> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
> > wrote:
> > >
> > > > see email below. reynold suggested i send it to dev instead of user
> > > >
> > > > -- Forwarded message --
> > > > From: Koert Kuipers 
> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
> > > > Subject: hadoop input/output format advanced control
> > > > To: "u...@spark.apache.org" 
> > > >
> > > >
> > > > currently its pretty hard to control the Hadoop Input/Output formats
> > used
> > > > in Spark. The conventions seems to be to add extra parameters to all
> > > > methods and then somewhere deep inside the code (for example in
> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> translated
> > > into
> > > > settings on the Hadoop Configuration object.
> > > >
> > > > for example for compression i see "codec: Option[Class[_ <:
> > > > CompressionCodec]] = None" added to a bunch of methods.
> > > >
> > > > how scalable is this solution really?
> > > >
> > > > for example i need to read from a hadoop dataset and i dont want the
> > > input
> > > > (part) files to get split up. the way to do this is to set
> > > > "mapred.min.split.size". now i dont want to set this at the level of
> > the
> > > > SparkContext (which can be done), since i dont want it to apply to
> > input
> > > > formats in general. i want it to apply to just this one specific
> input
> > > > dataset i need to read. which leaves me with no options currently. i
> > > could
> > > > go add yet another input parameter to all the methods
> > > > (SparkContext.textFile, SparkContext.hadoopFile,
> > SparkContext.objectFile,
> > > > etc.). but that seems ineffective.
> > > >
> > > > why can we not expose a Map[String, String] or some other generic way
> > to
> > > > manipulate settings for hadoop input/output formats? it would require
> > > > adding one more parameter to all methods to deal with hadoop
> > input/output
> > > > formats, but after that its done. one parameter to rule them all
> > > >
> > > > then i could do:
> > > > val x = sc.textFile("/some/path", formatSettings =
> > > > Map("mapred.min.split.size" -> "12345"))
> > > >
> > > > or
> > > > rdd.saveAsTextFile("/some/path, formatSettings =
> > > > Map(mapred.output.compress" -> "true",
> > "mapred.output.compression.codec"
> > > ->
> > > > "somecodec"))
> > > >
> > >
> >
>

Re: hadoop input/output format advanced control

2015-03-25 Thread Imran Rashid

Hi Nick,

I don't remember the exact details of these scenarios, but I think the user
wanted a lot more control over how the files got grouped into partitions,
to group the files together by some arbitrary function.  I didn't think
that was possible w/ CombineFileInputFormat, but maybe there is a way?

thanks

On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
wrote:

> Imran, on your point to read multiple files together in a partition, is it
> not simpler to use the approach of copy Hadoop conf and set per-RDD
> settings for min split to control the input size per partition, together
> with something like CombineFileInputFormat?
>
> On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> wrote:
>
> > I think this would be a great addition, I totally agree that you need to
> be
> > able to set these at a finer context than just the SparkContext.
> >
> > Just to play devil's advocate, though -- the alternative is for you just
> > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> could
> > expose whatever you need.  Why is this solution better?  IMO the criteria
> > are:
> > (a) common operations
> > (b) error-prone / difficult to implement
> > (c) non-obvious, but important for performance
> >
> > I think this case fits (a) & (c), so I think its still worthwhile.  But
> its
> > also worth asking whether or not its too difficult for a user to extend
> > HadoopRDD right now.  There have been several cases in the past week
> where
> > we've suggested that a user should read from hdfs themselves (eg., to
> read
> > multiple files together in one partition) -- with*out* reusing the code
> in
> > HadoopRDD, though they would lose things like the metric tracking &
> > preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> > refactoring to make that easier to do?  Or do we just need a good
> example?
> >
> > Imran
> >
> > (sorry for hijacking your thread, Koert)
> >
> >
> >
> > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
> wrote:
> >
> > > see email below. reynold suggested i send it to dev instead of user
> > >
> > > -- Forwarded message --
> > > From: Koert Kuipers 
> > > Date: Mon, Mar 23, 2015 at 4:36 PM
> > > Subject: hadoop input/output format advanced control
> > > To: "u...@spark.apache.org" 
> > >
> > >
> > > currently its pretty hard to control the Hadoop Input/Output formats
> used
> > > in Spark. The conventions seems to be to add extra parameters to all
> > > methods and then somewhere deep inside the code (for example in
> > > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> > into
> > > settings on the Hadoop Configuration object.
> > >
> > > for example for compression i see "codec: Option[Class[_ <:
> > > CompressionCodec]] = None" added to a bunch of methods.
> > >
> > > how scalable is this solution really?
> > >
> > > for example i need to read from a hadoop dataset and i dont want the
> > input
> > > (part) files to get split up. the way to do this is to set
> > > "mapred.min.split.size". now i dont want to set this at the level of
> the
> > > SparkContext (which can be done), since i dont want it to apply to
> input
> > > formats in general. i want it to apply to just this one specific input
> > > dataset i need to read. which leaves me with no options currently. i
> > could
> > > go add yet another input parameter to all the methods
> > > (SparkContext.textFile, SparkContext.hadoopFile,
> SparkContext.objectFile,
> > > etc.). but that seems ineffective.
> > >
> > > why can we not expose a Map[String, String] or some other generic way
> to
> > > manipulate settings for hadoop input/output formats? it would require
> > > adding one more parameter to all methods to deal with hadoop
> input/output
> > > formats, but after that its done. one parameter to rule them all
> > >
> > > then i could do:
> > > val x = sc.textFile("/some/path", formatSettings =
> > > Map("mapred.min.split.size" -> "12345"))
> > >
> > > or
> > > rdd.saveAsTextFile("/some/path, formatSettings =
> > > Map(mapred.output.compress" -> "true",
> "mapred.output.compression.codec"
> > ->
> > > "somecodec"))
> > >
> >
>

Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers

 >> >>> >> expose whatever you need.  Why is this solution better?  IMO the
> >> >>> >> criteria
> >> >>> >> are:
> >> >>> >> (a) common operations
> >> >>> >> (b) error-prone / difficult to implement
> >> >>> >> (c) non-obvious, but important for performance
> >> >>> >>
> >> >>> >> I think this case fits (a) & (c), so I think its still
> worthwhile.
> >> >>> >> But its
> >> >>> >> also worth asking whether or not its too difficult for a user to
> >> >>> >> extend
> >> >>> >> HadoopRDD right now.  There have been several cases in the past
> >> >>> >> week
> >> >>> >> where
> >> >>> >> we've suggested that a user should read from hdfs themselves
> (eg.,
> >> >>> >> to
> >> >>> >> read
> >> >>> >> multiple files together in one partition) -- with*out* reusing
> the
> >> >>> >> code in
> >> >>> >> HadoopRDD, though they would lose things like the metric
> tracking &
> >> >>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need
> to
> >> >>> >> some
> >> >>> >> refactoring to make that easier to do?  Or do we just need a good
> >> >>> >> example?
> >> >>> >>
> >> >>> >> Imran
> >> >>> >>
> >> >>> >> (sorry for hijacking your thread, Koert)
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <
> ko...@tresata.com>
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >> > see email below. reynold suggested i send it to dev instead of
> >> >>> >> > user
> >> >>> >> >
> >> >>> >> > -- Forwarded message --
> >> >>> >> > From: Koert Kuipers 
> >> >>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> >>> >> > Subject: hadoop input/output format advanced control
> >> >>> >> > To: "u...@spark.apache.org" 
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > currently its pretty hard to control the Hadoop Input/Output
> >> >>> >> > formats
> >> >>> >> > used
> >> >>> >> > in Spark. The conventions seems to be to add extra parameters
> to
> >> >>> >> > all
> >> >>> >> > methods and then somewhere deep inside the code (for example in
> >> >>> >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> >> >>> >> > translated
> >> >>> >> into
> >> >>> >> > settings on the Hadoop Configuration object.
> >> >>> >> >
> >> >>> >> > for example for compression i see "codec: Option[Class[_ <:
> >> >>> >> > CompressionCodec]] = None" added to a bunch of methods.
> >> >>> >> >
> >> >>> >> > how scalable is this solution really?
> >> >>> >> >
> >> >>> >> > for example i need to read from a hadoop dataset and i dont
> want
> >> >>> >> > the
> >> >>> >> input
> >> >>> >> > (part) files to get split up. the way to do this is to set
> >> >>> >> > "mapred.min.split.size". now i dont want to set this at the
> level
> >> >>> >> > of
> >> >>> >> > the
> >> >>> >> > SparkContext (which can be done), since i dont want it to apply
> >> >>> >> > to
> >> >>> >> > input
> >> >>> >> > formats in general. i want it to apply to just this one
> specific
> >> >>> >> > input
> >> >>> >> > dataset i need to read. which leaves me with no options
> >> >>> >> > currently. i
> >> >>> >> could
> >> >>> >> > go add yet another input parameter to all the methods
> >> >>> >> > (SparkContext.textFile, SparkContext.hadoopFile,
> >> >>> >> > SparkContext.objectFile,
> >> >>> >> > etc.). but that seems ineffective.
> >> >>> >> >
> >> >>> >> > why can we not expose a Map[String, String] or some other
> generic
> >> >>> >> > way to
> >> >>> >> > manipulate settings for hadoop input/output formats? it would
> >> >>> >> > require
> >> >>> >> > adding one more parameter to all methods to deal with hadoop
> >> >>> >> > input/output
> >> >>> >> > formats, but after that its done. one parameter to rule them
> >> >>> >> > all
> >> >>> >> >
> >> >>> >> > then i could do:
> >> >>> >> > val x = sc.textFile("/some/path", formatSettings =
> >> >>> >> > Map("mapred.min.split.size" -> "12345"))
> >> >>> >> >
> >> >>> >> > or
> >> >>> >> > rdd.saveAsTextFile("/some/path, formatSettings =
> >> >>> >> > Map(mapred.output.compress" -> "true",
> >> >>> >> > "mapred.output.compression.codec"
> >> >>> >> ->
> >> >>> >> > "somecodec"))
> >> >>> >> >
> >> >>> >>
> >> >>>
> >> >>>
> -
> >> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell

Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:

val conf = sc.hadoopConfiguration.clone()
  .set("k1", "v1")
  .set("k2", "v2")

val rdd = sc.objectFile(..., conf)

I have no idea if that's the correct syntax, but something like that
seems almost as easy as passing a hashmap with deltas.

- Patrick

On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers  wrote:
> my personal preference would be something like a Map[String, String] that
> only reflects the changes you want to make the Configuration for the given
> input/output format (so system wide defaults continue to come from
> sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
> arbitrary Configuration will work too.
>
> i will make a jira and pullreq when i have some time.
>
>
>
> On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell  wrote:
>>
>> I see - if you look, in the saving functions we have the option for
>> the user to pass an arbitrary Configuration.
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894
>>
>> It seems fine to have the same option for the loading functions, if
>> it's easy to just pass this config into the input format.
>>
>>
>>
>> On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
>> > the (compression) codec parameter that is now part of many saveAs...
>> > methods
>> > came from a similar need. see SPARK-763
>> > hadoop has many options like this. you either going to have to allow
>> > many
>> > more of these optional arguments to all the methods that read from
>> > hadoop
>> > inputformats and write to hadoop outputformats, or you force people to
>> > re-create these methods using HadoopRDD, i think (if thats even
>> > possible).
>> >
>> > On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers 
>> > wrote:
>> >>
>> >> i would like to use objectFile with some tweaks to the hadoop conf.
>> >> currently there is no way to do that, except recreating objectFile
>> >> myself.
>> >> and some of the code objectFile uses i have no access to, since its
>> >> private
>> >> to spark.
>> >>
>> >>
>> >> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
>> >> wrote:
>> >>>
>> >>> Yeah - to Nick's point, I think the way to do this is to pass in a
>> >>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>> >>> field is there). Is there anything you can't do with that feature?
>> >>>
>> >>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>> >>>  wrote:
>> >>> > Imran, on your point to read multiple files together in a partition,
>> >>> > is
>> >>> > it
>> >>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> >>> > settings for min split to control the input size per partition,
>> >>> > together
>> >>> > with something like CombineFileInputFormat?
>> >>> >
>> >>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> >>> > wrote:
>> >>> >
>> >>> >> I think this would be a great addition, I totally agree that you
>> >>> >> need
>> >>> >> to be
>> >>> >> able to set these at a finer context than just the SparkContext.
>> >>> >>
>> >>> >> Just to play devil's advocate, though -- the alternative is for you
>> >>> >> just
>> >>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then
>> >>> >> you
>> >>> >> could
>> >>> >> expose whatever you need.  Why is this solution better?  IMO the
>> >>> >> criteria
>> >>> >> are:
>> >>> >> (a) common operations
>> >>> >> (b) error-prone / difficult to implement
>> >>> >> (c) non-obvious, but important for performance
>> >>> >>
>> >>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>> >>> >> But its
>> >>> >> also worth asking whether or not

Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers

my personal preference would be something like a Map[String, String] that
only reflects the changes you want to make the Configuration for the given
input/output format (so system wide defaults continue to come from
sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
arbitrary Configuration will work too.

i will make a jira and pullreq when i have some time.



On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell  wrote:

> I see - if you look, in the saving functions we have the option for
> the user to pass an arbitrary Configuration.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894
>
> It seems fine to have the same option for the loading functions, if
> it's easy to just pass this config into the input format.
>
>
>
> On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
> > the (compression) codec parameter that is now part of many saveAs...
> methods
> > came from a similar need. see SPARK-763
> > hadoop has many options like this. you either going to have to allow many
> > more of these optional arguments to all the methods that read from hadoop
> > inputformats and write to hadoop outputformats, or you force people to
> > re-create these methods using HadoopRDD, i think (if thats even
> possible).
> >
> > On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers 
> wrote:
> >>
> >> i would like to use objectFile with some tweaks to the hadoop conf.
> >> currently there is no way to do that, except recreating objectFile
> myself.
> >> and some of the code objectFile uses i have no access to, since its
> private
> >> to spark.
> >>
> >>
> >> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
> >> wrote:
> >>>
> >>> Yeah - to Nick's point, I think the way to do this is to pass in a
> >>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
> >>> field is there). Is there anything you can't do with that feature?
> >>>
> >>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
> >>>  wrote:
> >>> > Imran, on your point to read multiple files together in a partition,
> is
> >>> > it
> >>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> >>> > settings for min split to control the input size per partition,
> >>> > together
> >>> > with something like CombineFileInputFormat?
> >>> >
> >>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> >>> > wrote:
> >>> >
> >>> >> I think this would be a great addition, I totally agree that you
> need
> >>> >> to be
> >>> >> able to set these at a finer context than just the SparkContext.
> >>> >>
> >>> >> Just to play devil's advocate, though -- the alternative is for you
> >>> >> just
> >>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
> >>> >> could
> >>> >> expose whatever you need.  Why is this solution better?  IMO the
> >>> >> criteria
> >>> >> are:
> >>> >> (a) common operations
> >>> >> (b) error-prone / difficult to implement
> >>> >> (c) non-obvious, but important for performance
> >>> >>
> >>> >> I think this case fits (a) & (c), so I think its still worthwhile.
> >>> >> But its
> >>> >> also worth asking whether or not its too difficult for a user to
> >>> >> extend
> >>> >> HadoopRDD right now.  There have been several cases in the past week
> >>> >> where
> >>> >> we've suggested that a user should read from hdfs themselves (eg.,
> to
> >>> >> read
> >>> >> multiple files together in one partition) -- with*out* reusing the
> >>> >> code in
> >>> >> HadoopRDD, though they would lose things like the metric tracking &
> >>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> >>> >> some
> >>> >> refactoring to make that easier to do?  Or do we just need a good
> >>> >> example?
> >>> >>
> >>> >> Imran
> >>> >>
> >>> >> (sorry for hijacking your thread, Koert)
> >>> >>
> >>> >>
> >>> >>
>

Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell

I see - if you look, in the saving functions we have the option for
the user to pass an arbitrary Configuration.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894

It seems fine to have the same option for the loading functions, if
it's easy to just pass this config into the input format.



On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
> the (compression) codec parameter that is now part of many saveAs... methods
> came from a similar need. see SPARK-763
> hadoop has many options like this. you either going to have to allow many
> more of these optional arguments to all the methods that read from hadoop
> inputformats and write to hadoop outputformats, or you force people to
> re-create these methods using HadoopRDD, i think (if thats even possible).
>
> On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers  wrote:
>>
>> i would like to use objectFile with some tweaks to the hadoop conf.
>> currently there is no way to do that, except recreating objectFile myself.
>> and some of the code objectFile uses i have no access to, since its private
>> to spark.
>>
>>
>> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
>> wrote:
>>>
>>> Yeah - to Nick's point, I think the way to do this is to pass in a
>>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>>> field is there). Is there anything you can't do with that feature?
>>>
>>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>>>  wrote:
>>> > Imran, on your point to read multiple files together in a partition, is
>>> > it
>>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>>> > settings for min split to control the input size per partition,
>>> > together
>>> > with something like CombineFileInputFormat?
>>> >
>>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>>> > wrote:
>>> >
>>> >> I think this would be a great addition, I totally agree that you need
>>> >> to be
>>> >> able to set these at a finer context than just the SparkContext.
>>> >>
>>> >> Just to play devil's advocate, though -- the alternative is for you
>>> >> just
>>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
>>> >> could
>>> >> expose whatever you need.  Why is this solution better?  IMO the
>>> >> criteria
>>> >> are:
>>> >> (a) common operations
>>> >> (b) error-prone / difficult to implement
>>> >> (c) non-obvious, but important for performance
>>> >>
>>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>>> >> But its
>>> >> also worth asking whether or not its too difficult for a user to
>>> >> extend
>>> >> HadoopRDD right now.  There have been several cases in the past week
>>> >> where
>>> >> we've suggested that a user should read from hdfs themselves (eg., to
>>> >> read
>>> >> multiple files together in one partition) -- with*out* reusing the
>>> >> code in
>>> >> HadoopRDD, though they would lose things like the metric tracking &
>>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>>> >> some
>>> >> refactoring to make that easier to do?  Or do we just need a good
>>> >> example?
>>> >>
>>> >> Imran
>>> >>
>>> >> (sorry for hijacking your thread, Koert)
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>>> >> wrote:
>>> >>
>>> >> > see email below. reynold suggested i send it to dev instead of user
>>> >> >
>>> >> > -- Forwarded message --
>>> >> > From: Koert Kuipers 
>>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
>>> >> > Subject: hadoop input/output format advanced control
>>> >> > To: "u...@spark.apache.org" 
>>> >> >
>>> >> >
>>> >> > currently its pretty hard to control the Hadoop Input/Output formats
>>> >> > used
>>> >> > in Spark. The conventions seems to be to add extra parameters to all
>>> >> > methods and then somewhere deep

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers

the (compression) codec parameter that is now part of many saveAs...
methods came from a similar need. see SPARK-763
<https://issues.apache.org/jira/browse/SPARK-763>
hadoop has many options like this. you either going to have to allow many
more of these optional arguments to all the methods that read from hadoop
inputformats and write to hadoop outputformats, or you force people to
re-create these methods using HadoopRDD, i think (if thats even possible).

On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers  wrote:

> i would like to use objectFile with some tweaks to the hadoop conf.
> currently there is no way to do that, except recreating objectFile myself.
> and some of the code objectFile uses i have no access to, since its private
> to spark.
>
>
> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
> wrote:
>
>> Yeah - to Nick's point, I think the way to do this is to pass in a
>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>> field is there). Is there anything you can't do with that feature?
>>
>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>>  wrote:
>> > Imran, on your point to read multiple files together in a partition, is
>> it
>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> > settings for min split to control the input size per partition, together
>> > with something like CombineFileInputFormat?
>> >
>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> wrote:
>> >
>> >> I think this would be a great addition, I totally agree that you need
>> to be
>> >> able to set these at a finer context than just the SparkContext.
>> >>
>> >> Just to play devil's advocate, though -- the alternative is for you
>> just
>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
>> could
>> >> expose whatever you need.  Why is this solution better?  IMO the
>> criteria
>> >> are:
>> >> (a) common operations
>> >> (b) error-prone / difficult to implement
>> >> (c) non-obvious, but important for performance
>> >>
>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>> But its
>> >> also worth asking whether or not its too difficult for a user to extend
>> >> HadoopRDD right now.  There have been several cases in the past week
>> where
>> >> we've suggested that a user should read from hdfs themselves (eg., to
>> read
>> >> multiple files together in one partition) -- with*out* reusing the
>> code in
>> >> HadoopRDD, though they would lose things like the metric tracking &
>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>> some
>> >> refactoring to make that easier to do?  Or do we just need a good
>> example?
>> >>
>> >> Imran
>> >>
>> >> (sorry for hijacking your thread, Koert)
>> >>
>> >>
>> >>
>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>> wrote:
>> >>
>> >> > see email below. reynold suggested i send it to dev instead of user
>> >> >
>> >> > -- Forwarded message --
>> >> > From: Koert Kuipers 
>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
>> >> > Subject: hadoop input/output format advanced control
>> >> > To: "u...@spark.apache.org" 
>> >> >
>> >> >
>> >> > currently its pretty hard to control the Hadoop Input/Output formats
>> used
>> >> > in Spark. The conventions seems to be to add extra parameters to all
>> >> > methods and then somewhere deep inside the code (for example in
>> >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get
>> translated
>> >> into
>> >> > settings on the Hadoop Configuration object.
>> >> >
>> >> > for example for compression i see "codec: Option[Class[_ <:
>> >> > CompressionCodec]] = None" added to a bunch of methods.
>> >> >
>> >> > how scalable is this solution really?
>> >> >
>> >> > for example i need to read from a hadoop dataset and i dont want the
>> >> input
>> >> > (part) files to get split up. the way to do this is to set
>> >> > "mapred.min.split.size". now i dont want to set this at the level of
>> the
>> >> > SparkC

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers

i would like to use objectFile with some tweaks to the hadoop conf.
currently there is no way to do that, except recreating objectFile myself.
and some of the code objectFile uses i have no access to, since its private
to spark.


On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell  wrote:

> Yeah - to Nick's point, I think the way to do this is to pass in a
> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
> field is there). Is there anything you can't do with that feature?
>
> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>  wrote:
> > Imran, on your point to read multiple files together in a partition, is
> it
> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> > settings for min split to control the input size per partition, together
> > with something like CombineFileInputFormat?
> >
> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> wrote:
> >
> >> I think this would be a great addition, I totally agree that you need
> to be
> >> able to set these at a finer context than just the SparkContext.
> >>
> >> Just to play devil's advocate, though -- the alternative is for you just
> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
> could
> >> expose whatever you need.  Why is this solution better?  IMO the
> criteria
> >> are:
> >> (a) common operations
> >> (b) error-prone / difficult to implement
> >> (c) non-obvious, but important for performance
> >>
> >> I think this case fits (a) & (c), so I think its still worthwhile.  But
> its
> >> also worth asking whether or not its too difficult for a user to extend
> >> HadoopRDD right now.  There have been several cases in the past week
> where
> >> we've suggested that a user should read from hdfs themselves (eg., to
> read
> >> multiple files together in one partition) -- with*out* reusing the code
> in
> >> HadoopRDD, though they would lose things like the metric tracking &
> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> >> refactoring to make that easier to do?  Or do we just need a good
> example?
> >>
> >> Imran
> >>
> >> (sorry for hijacking your thread, Koert)
> >>
> >>
> >>
> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
> wrote:
> >>
> >> > see email below. reynold suggested i send it to dev instead of user
> >> >
> >> > -- Forwarded message --
> >> > From: Koert Kuipers 
> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> > Subject: hadoop input/output format advanced control
> >> > To: "u...@spark.apache.org" 
> >> >
> >> >
> >> > currently its pretty hard to control the Hadoop Input/Output formats
> used
> >> > in Spark. The conventions seems to be to add extra parameters to all
> >> > methods and then somewhere deep inside the code (for example in
> >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> >> into
> >> > settings on the Hadoop Configuration object.
> >> >
> >> > for example for compression i see "codec: Option[Class[_ <:
> >> > CompressionCodec]] = None" added to a bunch of methods.
> >> >
> >> > how scalable is this solution really?
> >> >
> >> > for example i need to read from a hadoop dataset and i dont want the
> >> input
> >> > (part) files to get split up. the way to do this is to set
> >> > "mapred.min.split.size". now i dont want to set this at the level of
> the
> >> > SparkContext (which can be done), since i dont want it to apply to
> input
> >> > formats in general. i want it to apply to just this one specific input
> >> > dataset i need to read. which leaves me with no options currently. i
> >> could
> >> > go add yet another input parameter to all the methods
> >> > (SparkContext.textFile, SparkContext.hadoopFile,
> SparkContext.objectFile,
> >> > etc.). but that seems ineffective.
> >> >
> >> > why can we not expose a Map[String, String] or some other generic way
> to
> >> > manipulate settings for hadoop input/output formats? it would require
> >> > adding one more parameter to all methods to deal with hadoop
> input/output
> >> > formats, but after that its done. one parameter to rule them all
> >> >
> >> > then i could do:
> >> > val x = sc.textFile("/some/path", formatSettings =
> >> > Map("mapred.min.split.size" -> "12345"))
> >> >
> >> > or
> >> > rdd.saveAsTextFile("/some/path, formatSettings =
> >> > Map(mapred.output.compress" -> "true",
> "mapred.output.compression.codec"
> >> ->
> >> > "somecodec"))
> >> >
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell

Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?

On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
 wrote:
> Imran, on your point to read multiple files together in a partition, is it
> not simpler to use the approach of copy Hadoop conf and set per-RDD
> settings for min split to control the input size per partition, together
> with something like CombineFileInputFormat?
>
> On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid  wrote:
>
>> I think this would be a great addition, I totally agree that you need to be
>> able to set these at a finer context than just the SparkContext.
>>
>> Just to play devil's advocate, though -- the alternative is for you just
>> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
>> expose whatever you need.  Why is this solution better?  IMO the criteria
>> are:
>> (a) common operations
>> (b) error-prone / difficult to implement
>> (c) non-obvious, but important for performance
>>
>> I think this case fits (a) & (c), so I think its still worthwhile.  But its
>> also worth asking whether or not its too difficult for a user to extend
>> HadoopRDD right now.  There have been several cases in the past week where
>> we've suggested that a user should read from hdfs themselves (eg., to read
>> multiple files together in one partition) -- with*out* reusing the code in
>> HadoopRDD, though they would lose things like the metric tracking &
>> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
>> refactoring to make that easier to do?  Or do we just need a good example?
>>
>> Imran
>>
>> (sorry for hijacking your thread, Koert)
>>
>>
>>
>> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers  wrote:
>>
>> > see email below. reynold suggested i send it to dev instead of user
>> >
>> > -- Forwarded message --
>> > From: Koert Kuipers 
>> > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > Subject: hadoop input/output format advanced control
>> > To: "u...@spark.apache.org" 
>> >
>> >
>> > currently its pretty hard to control the Hadoop Input/Output formats used
>> > in Spark. The conventions seems to be to add extra parameters to all
>> > methods and then somewhere deep inside the code (for example in
>> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
>> into
>> > settings on the Hadoop Configuration object.
>> >
>> > for example for compression i see "codec: Option[Class[_ <:
>> > CompressionCodec]] = None" added to a bunch of methods.
>> >
>> > how scalable is this solution really?
>> >
>> > for example i need to read from a hadoop dataset and i dont want the
>> input
>> > (part) files to get split up. the way to do this is to set
>> > "mapred.min.split.size". now i dont want to set this at the level of the
>> > SparkContext (which can be done), since i dont want it to apply to input
>> > formats in general. i want it to apply to just this one specific input
>> > dataset i need to read. which leaves me with no options currently. i
>> could
>> > go add yet another input parameter to all the methods
>> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
>> > etc.). but that seems ineffective.
>> >
>> > why can we not expose a Map[String, String] or some other generic way to
>> > manipulate settings for hadoop input/output formats? it would require
>> > adding one more parameter to all methods to deal with hadoop input/output
>> > formats, but after that its done. one parameter to rule them all
>> >
>> > then i could do:
>> > val x = sc.textFile("/some/path", formatSettings =
>> > Map("mapred.min.split.size" -> "12345"))
>> >
>> > or
>> > rdd.saveAsTextFile("/some/path, formatSettings =
>> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
>> ->
>> > "somecodec"))
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath

Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?

On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid  wrote:

> I think this would be a great addition, I totally agree that you need to be
> able to set these at a finer context than just the SparkContext.
>
> Just to play devil's advocate, though -- the alternative is for you just
> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
> expose whatever you need.  Why is this solution better?  IMO the criteria
> are:
> (a) common operations
> (b) error-prone / difficult to implement
> (c) non-obvious, but important for performance
>
> I think this case fits (a) & (c), so I think its still worthwhile.  But its
> also worth asking whether or not its too difficult for a user to extend
> HadoopRDD right now.  There have been several cases in the past week where
> we've suggested that a user should read from hdfs themselves (eg., to read
> multiple files together in one partition) -- with*out* reusing the code in
> HadoopRDD, though they would lose things like the metric tracking &
> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> refactoring to make that easier to do?  Or do we just need a good example?
>
> Imran
>
> (sorry for hijacking your thread, Koert)
>
>
>
> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers  wrote:
>
> > see email below. reynold suggested i send it to dev instead of user
> >
> > ------ Forwarded message ------
> > From: Koert Kuipers 
> > Date: Mon, Mar 23, 2015 at 4:36 PM
> > Subject: hadoop input/output format advanced control
> > To: "u...@spark.apache.org" 
> >
> >
> > currently its pretty hard to control the Hadoop Input/Output formats used
> > in Spark. The conventions seems to be to add extra parameters to all
> > methods and then somewhere deep inside the code (for example in
> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> into
> > settings on the Hadoop Configuration object.
> >
> > for example for compression i see "codec: Option[Class[_ <:
> > CompressionCodec]] = None" added to a bunch of methods.
> >
> > how scalable is this solution really?
> >
> > for example i need to read from a hadoop dataset and i dont want the
> input
> > (part) files to get split up. the way to do this is to set
> > "mapred.min.split.size". now i dont want to set this at the level of the
> > SparkContext (which can be done), since i dont want it to apply to input
> > formats in general. i want it to apply to just this one specific input
> > dataset i need to read. which leaves me with no options currently. i
> could
> > go add yet another input parameter to all the methods
> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
> > etc.). but that seems ineffective.
> >
> > why can we not expose a Map[String, String] or some other generic way to
> > manipulate settings for hadoop input/output formats? it would require
> > adding one more parameter to all methods to deal with hadoop input/output
> > formats, but after that its done. one parameter to rule them all
> >
> > then i could do:
> > val x = sc.textFile("/some/path", formatSettings =
> > Map("mapred.min.split.size" -> "12345"))
> >
> > or
> > rdd.saveAsTextFile("/some/path, formatSettings =
> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
> ->
> > "somecodec"))
> >
>

Re: hadoop input/output format advanced control

2015-03-24 Thread Imran Rashid

I think this would be a great addition, I totally agree that you need to be
able to set these at a finer context than just the SparkContext.

Just to play devil's advocate, though -- the alternative is for you just
subclass HadoopRDD yourself, or make a totally new RDD, and then you could
expose whatever you need.  Why is this solution better?  IMO the criteria
are:
(a) common operations
(b) error-prone / difficult to implement
(c) non-obvious, but important for performance

I think this case fits (a) & (c), so I think its still worthwhile.  But its
also worth asking whether or not its too difficult for a user to extend
HadoopRDD right now.  There have been several cases in the past week where
we've suggested that a user should read from hdfs themselves (eg., to read
multiple files together in one partition) -- with*out* reusing the code in
HadoopRDD, though they would lose things like the metric tracking &
preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
refactoring to make that easier to do?  Or do we just need a good example?

Imran

(sorry for hijacking your thread, Koert)

On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers  wrote:

> see email below. reynold suggested i send it to dev instead of user
>
> -- Forwarded message --
> From: Koert Kuipers 
> Date: Mon, Mar 23, 2015 at 4:36 PM
> Subject: hadoop input/output format advanced control
> To: "u...@spark.apache.org" 
>
>
> currently its pretty hard to control the Hadoop Input/Output formats used
> in Spark. The conventions seems to be to add extra parameters to all
> methods and then somewhere deep inside the code (for example in
> PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
> settings on the Hadoop Configuration object.
>
> for example for compression i see "codec: Option[Class[_ <:
> CompressionCodec]] = None" added to a bunch of methods.
>
> how scalable is this solution really?
>
> for example i need to read from a hadoop dataset and i dont want the input
> (part) files to get split up. the way to do this is to set
> "mapred.min.split.size". now i dont want to set this at the level of the
> SparkContext (which can be done), since i dont want it to apply to input
> formats in general. i want it to apply to just this one specific input
> dataset i need to read. which leaves me with no options currently. i could
> go add yet another input parameter to all the methods
> (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
> etc.). but that seems ineffective.
>
> why can we not expose a Map[String, String] or some other generic way to
> manipulate settings for hadoop input/output formats? it would require
> adding one more parameter to all methods to deal with hadoop input/output
> formats, but after that its done. one parameter to rule them all
>
> then i could do:
> val x = sc.textFile("/some/path", formatSettings =
> Map("mapred.min.split.size" -> "12345"))
>
> or
> rdd.saveAsTextFile("/some/path, formatSettings =
> Map(mapred.output.compress" -> "true", "mapred.output.compression.codec" ->
> "somecodec"))
>

Fwd: hadoop input/output format advanced control

2015-03-23 Thread Koert Kuipers

see email below. reynold suggested i send it to dev instead of user

-- Forwarded message --
From: Koert Kuipers 
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: "u...@spark.apache.org" 

currently its pretty hard to control the Hadoop Input/Output formats used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
settings on the Hadoop Configuration object.

for example for compression i see "codec: Option[Class[_ <:
CompressionCodec]] = None" added to a bunch of methods.

how scalable is this solution really?

for example i need to read from a hadoop dataset and i dont want the input
(part) files to get split up. the way to do this is to set
"mapred.min.split.size". now i dont want to set this at the level of the
SparkContext (which can be done), since i dont want it to apply to input
formats in general. i want it to apply to just this one specific input
dataset i need to read. which leaves me with no options currently. i could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
etc.). but that seems ineffective.

why can we not expose a Map[String, String] or some other generic way to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop input/output
formats, but after that its done. one parameter to rule them all

then i could do:
val x = sc.textFile("/some/path", formatSettings =
Map("mapred.min.split.size" -> "12345"))

or
rdd.saveAsTextFile("/some/path, formatSettings =
Map(mapred.output.compress" -> "true", "mapred.output.compression.codec" ->
"somecodec"))

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Re: hadoop input/output format advanced control

Fwd: hadoop input/output format advanced control

14 matches

Site Navigation

Mail list logo

Footer information