Re: Split RDD by key and save to different files

Dhaval Patel Wed, 07 Sep 2016 11:24:06 -0700

In order to do that, first of all you need to Key RDD by Key. and then use
saveAsHadoopFile in this way:

We can use saveAsHadoopFile(location,classOf[KeyClass],
classOf[ValueClass], classOf[PartitionOutputFormat])

When PartitionOutputFormat is extended from MultipleTextOutputFormat.

Sample for that is below:

class PartitionOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any =
    /// Add logic if you want to create any Key from Key and Value

  override def generateFileNameForKeyValue(key: Any, value: Any, basePath:
String): String = {
   /// Add logic to generate file name from Key and Value, Generally we use
basePath and add Key to it to make filename for that set of keys.
  }
}

On Wed, Sep 7, 2016 at 10:58 AM, Vikash Kumar <vikashsp...@gmail.com> wrote:

> I need to spilt RDD [keys, Iterable[Value]]  to save each key into
> different file.
>
> e.g I have records like: customerId, name, age, sex
>
> 111,abc,34,M
> 122, xyz,32,F
> 111,def,31,F
> 122.trp,30,F
> 133,jkl,35,M
>
> I need to write 3 different files based on customerId
> file1:
> 111,abc,34,M
> 111,def,31,F
>
> file2:
> 122, xyz,32,F
> 122.trp,30,F
>
> file3:
> 133,jkl,35,M
>
> How I can achieve this in spark scala code?
>

Re: Split RDD by key and save to different files

Reply via email to