Should we mention that you should synchronize on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)
On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell <pwend...@gmail.com> wrote: > Great - that's even easier. Maybe we could have a simple example in the > doc. > > On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > > Regarding Patrick's question, you can just do "new > Configuration(oldConf)" > > to get a cloned Configuration object and add any new properties to it. > > > > -Sandy > > > > On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid <iras...@cloudera.com> > wrote: > > > >> Hi Nick, > >> > >> I don't remember the exact details of these scenarios, but I think the > user > >> wanted a lot more control over how the files got grouped into > partitions, > >> to group the files together by some arbitrary function. I didn't think > >> that was possible w/ CombineFileInputFormat, but maybe there is a way? > >> > >> thanks > >> > >> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath < > nick.pentre...@gmail.com> > >> wrote: > >> > >> > Imran, on your point to read multiple files together in a partition, > is > >> it > >> > not simpler to use the approach of copy Hadoop conf and set per-RDD > >> > settings for min split to control the input size per partition, > together > >> > with something like CombineFileInputFormat? > >> > > >> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <iras...@cloudera.com> > >> > wrote: > >> > > >> > > I think this would be a great addition, I totally agree that you > need > >> to > >> > be > >> > > able to set these at a finer context than just the SparkContext. > >> > > > >> > > Just to play devil's advocate, though -- the alternative is for you > >> just > >> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you > >> > could > >> > > expose whatever you need. Why is this solution better? IMO the > >> criteria > >> > > are: > >> > > (a) common operations > >> > > (b) error-prone / difficult to implement > >> > > (c) non-obvious, but important for performance > >> > > > >> > > I think this case fits (a) & (c), so I think its still worthwhile. > But > >> > its > >> > > also worth asking whether or not its too difficult for a user to > extend > >> > > HadoopRDD right now. There have been several cases in the past week > >> > where > >> > > we've suggested that a user should read from hdfs themselves (eg., > to > >> > read > >> > > multiple files together in one partition) -- with*out* reusing the > code > >> > in > >> > > HadoopRDD, though they would lose things like the metric tracking & > >> > > preferred locations you get from HadoopRDD. Does HadoopRDD need to > >> some > >> > > refactoring to make that easier to do? Or do we just need a good > >> > example? > >> > > > >> > > Imran > >> > > > >> > > (sorry for hijacking your thread, Koert) > >> > > > >> > > > >> > > > >> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <ko...@tresata.com> > >> > wrote: > >> > > > >> > > > see email below. reynold suggested i send it to dev instead of > user > >> > > > > >> > > > ---------- Forwarded message ---------- > >> > > > From: Koert Kuipers <ko...@tresata.com> > >> > > > Date: Mon, Mar 23, 2015 at 4:36 PM > >> > > > Subject: hadoop input/output format advanced control > >> > > > To: "u...@spark.apache.org" <u...@spark.apache.org> > >> > > > > >> > > > > >> > > > currently its pretty hard to control the Hadoop Input/Output > formats > >> > used > >> > > > in Spark. The conventions seems to be to add extra parameters to > all > >> > > > methods and then somewhere deep inside the code (for example in > >> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get > >> translated > >> > > into > >> > > > settings on the Hadoop Configuration object. > >> > > > > >> > > > for example for compression i see "codec: Option[Class[_ <: > >> > > > CompressionCodec]] = None" added to a bunch of methods. > >> > > > > >> > > > how scalable is this solution really? > >> > > > > >> > > > for example i need to read from a hadoop dataset and i dont want > the > >> > > input > >> > > > (part) files to get split up. the way to do this is to set > >> > > > "mapred.min.split.size". now i dont want to set this at the level > of > >> > the > >> > > > SparkContext (which can be done), since i dont want it to apply to > >> > input > >> > > > formats in general. i want it to apply to just this one specific > >> input > >> > > > dataset i need to read. which leaves me with no options > currently. i > >> > > could > >> > > > go add yet another input parameter to all the methods > >> > > > (SparkContext.textFile, SparkContext.hadoopFile, > >> > SparkContext.objectFile, > >> > > > etc.). but that seems ineffective. > >> > > > > >> > > > why can we not expose a Map[String, String] or some other generic > way > >> > to > >> > > > manipulate settings for hadoop input/output formats? it would > require > >> > > > adding one more parameter to all methods to deal with hadoop > >> > input/output > >> > > > formats, but after that its done. one parameter to rule them > all.... > >> > > > > >> > > > then i could do: > >> > > > val x = sc.textFile("/some/path", formatSettings = > >> > > > Map("mapred.min.split.size" -> "12345")) > >> > > > > >> > > > or > >> > > > rdd.saveAsTextFile("/some/path, formatSettings = > >> > > > Map(mapred.output.compress" -> "true", > >> > "mapred.output.compression.codec" > >> > > -> > >> > > > "somecodec")) > >> > > > > >> > > > >> > > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >