Yeah - to Nick's point, I think the way to do this is to pass in a custom conf when you create a Hadoop RDD (that's AFAIK why the conf field is there). Is there anything you can't do with that feature?
On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Imran, on your point to read multiple files together in a partition, is it > not simpler to use the approach of copy Hadoop conf and set per-RDD > settings for min split to control the input size per partition, together > with something like CombineFileInputFormat? > > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <iras...@cloudera.com> wrote: > >> I think this would be a great addition, I totally agree that you need to be >> able to set these at a finer context than just the SparkContext. >> >> Just to play devil's advocate, though -- the alternative is for you just >> subclass HadoopRDD yourself, or make a totally new RDD, and then you could >> expose whatever you need. Why is this solution better? IMO the criteria >> are: >> (a) common operations >> (b) error-prone / difficult to implement >> (c) non-obvious, but important for performance >> >> I think this case fits (a) & (c), so I think its still worthwhile. But its >> also worth asking whether or not its too difficult for a user to extend >> HadoopRDD right now. There have been several cases in the past week where >> we've suggested that a user should read from hdfs themselves (eg., to read >> multiple files together in one partition) -- with*out* reusing the code in >> HadoopRDD, though they would lose things like the metric tracking & >> preferred locations you get from HadoopRDD. Does HadoopRDD need to some >> refactoring to make that easier to do? Or do we just need a good example? >> >> Imran >> >> (sorry for hijacking your thread, Koert) >> >> >> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >> > see email below. reynold suggested i send it to dev instead of user >> > >> > ---------- Forwarded message ---------- >> > From: Koert Kuipers <ko...@tresata.com> >> > Date: Mon, Mar 23, 2015 at 4:36 PM >> > Subject: hadoop input/output format advanced control >> > To: "u...@spark.apache.org" <u...@spark.apache.org> >> > >> > >> > currently its pretty hard to control the Hadoop Input/Output formats used >> > in Spark. The conventions seems to be to add extra parameters to all >> > methods and then somewhere deep inside the code (for example in >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated >> into >> > settings on the Hadoop Configuration object. >> > >> > for example for compression i see "codec: Option[Class[_ <: >> > CompressionCodec]] = None" added to a bunch of methods. >> > >> > how scalable is this solution really? >> > >> > for example i need to read from a hadoop dataset and i dont want the >> input >> > (part) files to get split up. the way to do this is to set >> > "mapred.min.split.size". now i dont want to set this at the level of the >> > SparkContext (which can be done), since i dont want it to apply to input >> > formats in general. i want it to apply to just this one specific input >> > dataset i need to read. which leaves me with no options currently. i >> could >> > go add yet another input parameter to all the methods >> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile, >> > etc.). but that seems ineffective. >> > >> > why can we not expose a Map[String, String] or some other generic way to >> > manipulate settings for hadoop input/output formats? it would require >> > adding one more parameter to all methods to deal with hadoop input/output >> > formats, but after that its done. one parameter to rule them all.... >> > >> > then i could do: >> > val x = sc.textFile("/some/path", formatSettings = >> > Map("mapred.min.split.size" -> "12345")) >> > >> > or >> > rdd.saveAsTextFile("/some/path, formatSettings = >> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec" >> -> >> > "somecodec")) >> > >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org