My data is in S3 and is indexed in Dynamo. For example, If I want to load
data given a time range, I will first need to query Dynamo for the S3 file
keys for the corresponding time range and then load them in Spark. The
files may not always be in the same S3 path prefix, hence
sc.testFile("s3://directory_path/") won't
work. I am looking for pointers on how to implement something analogous to
HadoopRDD or JdbcRDD but in Java.

I am looking to do something similar to what they have done here:
https://github.com/lagerspetz/TimeSeriesSpark/blob/master/src/spark/timeseries/dynamodb/DynamoDbRDD.scala.
This one reads data from Dynamo, my custom RDD would query DynamoDB for the
S3 file keys, and then load them from S3.

On Mon, May 25, 2015 at 8:19 PM, Alex Robbins <alexander.j.robb...@gmail.com
> wrote:

> If a Hadoop InputFormat already exists for your data source, you can load
> it from there. Otherwise, maybe you can dump your data source out as text
> and load it from there. Without more detail on what your data source is,
> it'll be hard for anyone to help.
>
> On Mon, May 25, 2015 at 5:00 PM, swaranga <sarma.swara...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I have a custom data source and I want to load the data into Spark to
>> perform some computations. For this I see that I might need to implement a
>> new RDD for my data source.
>>
>> I am a complete Scala noob and I am hoping that I can implement the RDD in
>> Java only. I looked around the internet and could not find any resources.
>> Any pointers?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Sent from my Lumia thumb-typed with errors.

Reply via email to