I think you should also just be able to provide an input format that never
splits the input data.  This has come up before on the list, but I couldn't
find it.*

I think this should work, but I can't try it out at the moment.  Can you
please try and let us know if it works?

class TextFormatNoSplits extends TextInputFormat {
  override def isSplitable(fs: FileSystem, file: Path): Boolean = false
}

def textFileNoSplits(sc: SparkContext, path: String): RDD[String] = {
  //note this is just a copy of sc.textFile, with a different
InputFormatClass
  sc.hadoopFile(path, classOf[TextFormatNoSplits], classOf[LongWritable],
    classOf[Text]).map(pair => pair._2.toString).setName(path)
}


* yes I realize the irony given the recent discussion about mailing list
vs. stackoverflow ...

On Thu, Jan 22, 2015 at 11:01 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes, that second argument is what I was referring to, but yes it's a
> *minimum*, oops, right. OK, you will want to coalesce then, indeed.
>
> On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV)
> <ningjun.w...@lexisnexis.com> wrote:
> > Ø  If you know that this number is too high you can request a number of
> > partitions when you read it.
> >
> >
> >
> > How to do that? Can you give a code snippet? I want to read it into 8
> > partitions, so I do
> >
> >
> >
> > val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8)
> >
> > However rdd2 contains thousands of partitions instead of 8 partitions
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to