Re: Implementing custom RDD in Java
I know it isn't exactly what you are asking for, but you could solve it like this: Driver program queries dynamo for the s3 file keys. sc.textFile each of the file keys and .union them all together to make your RDD. You could wrap that up in a function and it wouldn't be too painful to reuse. I don't personally know about creating custom RDDs in Java. On Mon, May 25, 2015 at 10:37 PM, Swaranga Sarma sarma.swara...@gmail.com wrote: My data is in S3 and is indexed in Dynamo. For example, If I want to load data given a time range, I will first need to query Dynamo for the S3 file keys for the corresponding time range and then load them in Spark. The files may not always be in the same S3 path prefix, hence sc.testFile(s3://directory_path/) won't work. I am looking for pointers on how to implement something analogous to HadoopRDD or JdbcRDD but in Java. I am looking to do something similar to what they have done here: https://github.com/lagerspetz/TimeSeriesSpark/blob/master/src/spark/timeseries/dynamodb/DynamoDbRDD.scala. This one reads data from Dynamo, my custom RDD would query DynamoDB for the S3 file keys, and then load them from S3. On Mon, May 25, 2015 at 8:19 PM, Alex Robbins alexander.j.robb...@gmail.com wrote: If a Hadoop InputFormat already exists for your data source, you can load it from there. Otherwise, maybe you can dump your data source out as text and load it from there. Without more detail on what your data source is, it'll be hard for anyone to help. On Mon, May 25, 2015 at 5:00 PM, swaranga sarma.swara...@gmail.com wrote: Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet and could not find any resources. Any pointers? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sent from my Lumia thumb-typed with errors.
Implementing custom RDD in Java
Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet and could not find any resources. Any pointers? -- Sent from my Lumia thumb-typed with errors.
Implementing custom RDD in Java
Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet and could not find any resources. Any pointers? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Implementing custom RDD in Java
My data is in S3 and is indexed in Dynamo. For example, If I want to load data given a time range, I will first need to query Dynamo for the S3 file keys for the corresponding time range and then load them in Spark. The files may not always be in the same S3 path prefix, hence sc.testFile(s3://directory_path/) won't work. I am looking for pointers on how to implement something analogous to HadoopRDD or JdbcRDD but in Java. I am looking to do something similar to what they have done here: https://github.com/lagerspetz/TimeSeriesSpark/blob/master/src/spark/timeseries/dynamodb/DynamoDbRDD.scala. This one reads data from Dynamo, my custom RDD would query DynamoDB for the S3 file keys, and then load them from S3. On Mon, May 25, 2015 at 8:19 PM, Alex Robbins alexander.j.robb...@gmail.com wrote: If a Hadoop InputFormat already exists for your data source, you can load it from there. Otherwise, maybe you can dump your data source out as text and load it from there. Without more detail on what your data source is, it'll be hard for anyone to help. On Mon, May 25, 2015 at 5:00 PM, swaranga sarma.swara...@gmail.com wrote: Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet and could not find any resources. Any pointers? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Sent from my Lumia thumb-typed with errors.