Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers
my personal preference would be something like a Map[String, String] that only reflects the changes you want to make the Configuration for the given input/output format (so system wide defaults continue to come from sc.hadoopConfiguration), similarly to what cascading/scalding did, but am

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency with the input API's maybe we should do the same thing. We can also give an example of how to clone sc.hadoopConfiguration and then set some new values: val conf = sc.hadoopConfiguration.clone() .set(k1, v1) .set(k2, v2)

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote: Hi Nick, I don't remember the exact details of these scenarios, but

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc. On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it.

Re: hadoop input/output format advanced control

2015-03-25 Thread Aaron Davidson
Should we mention that you should synchronize on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :) On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell pwend...@gmail.com wrote: Great - that's even easier.

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers
i would like to use objectFile with some tweaks to the hadoop conf. currently there is no way to do that, except recreating objectFile myself. and some of the code objectFile uses i have no access to, since its private to spark. On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com

Re: hadoop input/output format advanced control

2015-03-24 Thread Imran Rashid
I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM,

Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a custom conf when you create a Hadoop RDD (that's AFAIK why the conf field is there). Is there anything you can't do with that feature? On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Imran, on