Re: Converting DataFrame to RDD of case class

Jonathan Winandy Mon, 27 Jul 2015 13:31:51 -0700

Hello !

Can both methods be compare in term of performance ? Tried the pull request
and it felt slow compare to manual mapping.


Cheers,
Jonathan

On Mon, Jul 27, 2015, 8:51 PM Reynold Xin <r...@databricks.com> wrote:

> There is this pull request: https://github.com/apache/spark/pull/5713
>
> We mean to merge it for 1.5. Maybe you can help review it too?
>
> On Mon, Jul 27, 2015 at 11:23 AM, Vyacheslav Baranov <
> slavik.bara...@gmail.com> wrote:
>
>>  Hi all,
>>
>> For now it's possible to convert RDD of case class to DataFrame:
>>
>> case class Person(name: String, age: Int)
>>
>> val people: RDD[Person] = ...
>> val df = sqlContext.createDataFrame(people)
>>
>> but backward conversion is not possible with existing API, so currently
>> code looks like this (example from documentation):
>>
>> teenagers.map(t => "Name: " + t.getAs[String]("name"))
>>
>> whereas it would be much more convenient to use RDD of case class:
>>
>> teenagers.rdd[Person].map("Name: " + _.name)
>>
>>
>> I've implemented proof of concept library that allows to convert
>> DataFrame to typed RDD with "Pimp my library" pattern. It adds some
>> typesafety (conversion fails before running distributed operation if some
>> fields have incompatible types) and it's much more convenient when working
>> with nested rows, for example:
>>
>> case class Room(number: Int, visitors: Seq[Person])
>>
>> roomsDf.explode[Seq[Row], Person]("visitors",
>> "visitor")(_.map(rowToPerson))
>>
>> Would the community be interested in having this functionality in core?
>>
>> Regards,
>> Vyacheslav
>>
>>
>

Re: Converting DataFrame to RDD of case class

Reply via email to