Re: Implementing custom RDD in Java

2015-05-26 Thread Alex Robbins
I know it isn't exactly what you are asking for, but you could solve it
like this:

Driver program queries dynamo for the s3 file keys.
sc.textFile each of the file keys and .union them all together to make your
RDD.

You could wrap that up in a function and it wouldn't be too painful to
reuse. I don't personally know about creating custom RDDs in Java.

On Mon, May 25, 2015 at 10:37 PM, Swaranga Sarma sarma.swara...@gmail.com
wrote:

 My data is in S3 and is indexed in Dynamo. For example, If I want to load
 data given a time range, I will first need to query Dynamo for the S3 file
 keys for the corresponding time range and then load them in Spark. The
 files may not always be in the same S3 path prefix, hence 
 sc.testFile(s3://directory_path/) won't
 work. I am looking for pointers on how to implement something analogous to
 HadoopRDD or JdbcRDD but in Java.

 I am looking to do something similar to what they have done here:
 https://github.com/lagerspetz/TimeSeriesSpark/blob/master/src/spark/timeseries/dynamodb/DynamoDbRDD.scala.
 This one reads data from Dynamo, my custom RDD would query DynamoDB for the
 S3 file keys, and then load them from S3.

 On Mon, May 25, 2015 at 8:19 PM, Alex Robbins 
 alexander.j.robb...@gmail.com wrote:

 If a Hadoop InputFormat already exists for your data source, you can load
 it from there. Otherwise, maybe you can dump your data source out as text
 and load it from there. Without more detail on what your data source is,
 it'll be hard for anyone to help.

 On Mon, May 25, 2015 at 5:00 PM, swaranga sarma.swara...@gmail.com
 wrote:

 Hello,

 I have a custom data source and I want to load the data into Spark to
 perform some computations. For this I see that I might need to implement
 a
 new RDD for my data source.

 I am a complete Scala noob and I am hoping that I can implement the RDD
 in
 Java only. I looked around the internet and could not find any resources.
 Any pointers?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Sent from my Lumia thumb-typed with errors.



Implementing custom RDD in Java

2015-05-25 Thread Swaranga Sarma
Hello,

I have a custom data source and I want to load the data into Spark to
perform some computations. For this I see that I might need to implement a
new RDD for my data source.

I am a complete Scala noob and I am hoping that I can implement the RDD in
Java only. I looked around the internet and could not find any resources.
Any pointers?

-- 
Sent from my Lumia thumb-typed with errors.


Implementing custom RDD in Java

2015-05-25 Thread swaranga
Hello,

I have a custom data source and I want to load the data into Spark to
perform some computations. For this I see that I might need to implement a
new RDD for my data source.

I am a complete Scala noob and I am hoping that I can implement the RDD in
Java only. I looked around the internet and could not find any resources.
Any pointers?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Implementing custom RDD in Java

2015-05-25 Thread Swaranga Sarma
My data is in S3 and is indexed in Dynamo. For example, If I want to load
data given a time range, I will first need to query Dynamo for the S3 file
keys for the corresponding time range and then load them in Spark. The
files may not always be in the same S3 path prefix, hence
sc.testFile(s3://directory_path/) won't
work. I am looking for pointers on how to implement something analogous to
HadoopRDD or JdbcRDD but in Java.

I am looking to do something similar to what they have done here:
https://github.com/lagerspetz/TimeSeriesSpark/blob/master/src/spark/timeseries/dynamodb/DynamoDbRDD.scala.
This one reads data from Dynamo, my custom RDD would query DynamoDB for the
S3 file keys, and then load them from S3.

On Mon, May 25, 2015 at 8:19 PM, Alex Robbins alexander.j.robb...@gmail.com
 wrote:

 If a Hadoop InputFormat already exists for your data source, you can load
 it from there. Otherwise, maybe you can dump your data source out as text
 and load it from there. Without more detail on what your data source is,
 it'll be hard for anyone to help.

 On Mon, May 25, 2015 at 5:00 PM, swaranga sarma.swara...@gmail.com
 wrote:

 Hello,

 I have a custom data source and I want to load the data into Spark to
 perform some computations. For this I see that I might need to implement a
 new RDD for my data source.

 I am a complete Scala noob and I am hoping that I can implement the RDD in
 Java only. I looked around the internet and could not find any resources.
 Any pointers?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Sent from my Lumia thumb-typed with errors.