my personal preference would be something like a Map[String, String] that
only reflects the changes you want to make the Configuration for the given
input/output format (so system wide defaults continue to come from
sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:
val conf = sc.hadoopConfiguration.clone()
.set(k1, v1)
.set(k2, v2)
Regarding Patrick's question, you can just do new Configuration(oldConf)
to get a cloned Configuration object and add any new properties to it.
-Sandy
On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:
Hi Nick,
I don't remember the exact details of these scenarios, but
Great - that's even easier. Maybe we could have a simple example in the doc.
On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Regarding Patrick's question, you can just do new Configuration(oldConf)
to get a cloned Configuration object and add any new properties to it.
Should we mention that you should synchronize
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race
condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)
On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell pwend...@gmail.com wrote:
Great - that's even easier.
i would like to use objectFile with some tweaks to the hadoop conf.
currently there is no way to do that, except recreating objectFile myself.
and some of the code objectFile uses i have no access to, since its private
to spark.
On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com
I think this would be a great addition, I totally agree that you need to be
able to set these at a finer context than just the SparkContext.
Just to play devil's advocate, though -- the alternative is for you just
subclass HadoopRDD yourself, or make a totally new RDD, and then you could
expose
Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?
On Tue, Mar 24, 2015 at 5:28 PM,
Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?
On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
Imran, on