Re: Reading one partition at a time

Imran Rashid Tue, 13 Jan 2015 09:59:30 -0800

this looks reasonable to me.  As you've done, the important thing is just
to make isSplittable return false.


this shares a bit in common with the sc.wholeTextFile method.  It sounds
like you really want something much simpler than what that is doing, but
you might be interested in looking at that for more inspiration.

On Sun, Jan 4, 2015 at 4:30 PM, Michael Albert <
m_albert...@yahoo.com.invalid> wrote:

> Greetings!
>
> I would like to know if the code below will read "one-partition-at-a-time",
> and whether I am reinventing the wheel.
>
> If I may explain, upstream code has managed (I hope) to save an RDD such
> that each partition file (e.g, part-r-00000, part-r-00001) contains
> exactly the data subset which I would like to
> repackage in a file of a non-hadoop format.  So what I want to do is
> something like "mapPartitionsWithIndex" on this data (which is stored in
> sequence files, SNAPPY compressed).  However, if I simply open the data
> set with "sequenceFile()", the data is re-partitioned and I loose the
> partitioning
> I want. My intention is that in the closure passed to
> mapPartitionWithIndex,
>
> I'll open an HDFS file and write the data from the partition in my desired
> format,
> one file for each input partition.
> The code below seems to work, I think.  Have I missed something bad?
> Thanks!
>
> -Mike
>
>     class NonSplittingSequenceFileInputFormat[K,V]
>         //extends
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat[K,V] // XXX
>         extends org.apache.hadoop.mapred.SequenceFileInputFormat[K,V]
>     {
>         override def isSplitable(
>             //context: org.apache.hadoop.mapreduce.JobContext,
>             //path: org.apache.hadoop.fs.Path) = false
>             fs: org.apache.hadoop.fs.FileSystem,
>             filename: org.apache.hadoop.fs.Path) = false
>
>     }
>
>
> sc.hadoopFile(outPathPhase1,
>     classOf[NonSplittingSequenceFileInputFormat[K, V]],
>     classOf[K],
>    classOf[V],
>    1)
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Reading one partition at a time

Reply via email to