Re: create a SchemaRDD from a custom datasource

Reynold Xin Tue, 13 Jan 2015 11:11:25 -0800

If it is a small collection of them on the driver, you can just use
sc.parallelize to create an RDD.



On Tue, Jan 13, 2015 at 7:56 AM, Malith Dhanushka <mmali...@gmail.com>
wrote:

> Hi Reynold,
>
> Thanks for the response. I am just wondering, lets say we have set of Row
> objects. Isn't there a straightforward way of creating RDD[Row] out of it
> without writing a custom RDD?
>
> ie - a utility method
>
> Thanks
> Malith
>
> On Tue, Jan 13, 2015 at 2:29 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Depends on what the other side is doing. You can create your own RDD
>> implementation by subclassing RDD, or it might work if you use
>> sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data
>> and return an iterator */ ) where n is the number of partitions.
>>
>> On Tue, Jan 13, 2015 at 12:51 AM, Niranda Perera <
>> niranda.per...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have a custom datasources API, which connects to various data sources
>>> and exposes them out as a common API. We are now trying to implement the
>>> Spark datasources API released in 1.2.0 to connect Spark for analytics.
>>>
>>> Looking at the sources API, we figured out that we should extend a scan
>>> class (table scan etc). While doing so, we would have to implement the
>>> 'schema' and 'buildScan' methods.
>>>
>>> say, we can infer the schema of the underlying data and take data out as
>>> Row elements. Is there any way we could create RDD[Row] (needed in the
>>> buildScan method) using these Row elements?
>>>
>>> Cheers
>>> --
>>> Niranda
>>>
>>
>>
> <email-mmali...@gmail.com>
>
>

Re: create a SchemaRDD from a custom datasource

Reply via email to