I think you should also just be able to provide an input format that never splits the input data. This has come up before on the list, but I couldn't find it.*
I think this should work, but I can't try it out at the moment. Can you please try and let us know if it works? class TextFormatNoSplits extends TextInputFormat { override def isSplitable(fs: FileSystem, file: Path): Boolean = false } def textFileNoSplits(sc: SparkContext, path: String): RDD[String] = { //note this is just a copy of sc.textFile, with a different InputFormatClass sc.hadoopFile(path, classOf[TextFormatNoSplits], classOf[LongWritable], classOf[Text]).map(pair => pair._2.toString).setName(path) } * yes I realize the irony given the recent discussion about mailing list vs. stackoverflow ... On Thu, Jan 22, 2015 at 11:01 AM, Sean Owen <so...@cloudera.com> wrote: > Yes, that second argument is what I was referring to, but yes it's a > *minimum*, oops, right. OK, you will want to coalesce then, indeed. > > On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV) > <ningjun.w...@lexisnexis.com> wrote: > > Ø If you know that this number is too high you can request a number of > > partitions when you read it. > > > > > > > > How to do that? Can you give a code snippet? I want to read it into 8 > > partitions, so I do > > > > > > > > val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8) > > > > However rdd2 contains thousands of partitions instead of 8 partitions > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >