I think you should also just be able to provide an input format that never
splits the input data. This has come up before on the list, but I couldn't
find it.*
I think this should work, but I can't try it out at the moment. Can you
please try and let us know if it works?
class TextFormatNoSplits extends TextInputFormat {
override def isSplitable(fs: FileSystem, file: Path): Boolean = false
}
def textFileNoSplits(sc: SparkContext, path: String): RDD[String] = {
//note this is just a copy of sc.textFile, with a different
InputFormatClass
sc.hadoopFile(path, classOf[TextFormatNoSplits], classOf[LongWritable],
classOf[Text]).map(pair => pair._2.toString).setName(path)
}
* yes I realize the irony given the recent discussion about mailing list
vs. stackoverflow ...
On Thu, Jan 22, 2015 at 11:01 AM, Sean Owen <[email protected]> wrote:
> Yes, that second argument is what I was referring to, but yes it's a
> *minimum*, oops, right. OK, you will want to coalesce then, indeed.
>
> On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV)
> <[email protected]> wrote:
> > Ø If you know that this number is too high you can request a number of
> > partitions when you read it.
> >
> >
> >
> > How to do that? Can you give a code snippet? I want to read it into 8
> > partitions, so I do
> >
> >
> >
> > val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8)
> >
> > However rdd2 contains thousands of partitions instead of 8 partitions
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>