Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
I am not looking for Spark Sql specifically. My usecase is that I need to save an RDD as a parquet file in hdfs at the end of a batch and load it back and convert it into an RDD in the next batch. The RDD has a String and a Long as the key/value pairs. On Wed, Nov 4, 2015 at 11:52 PM, Stefano

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
How to convert a parquet file that is saved in hdfs to an RDD after reading the file from hdfs? On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman wrote: > Hi, > we are using avro with compression(snappy). As soon as you have enough > partitions, the saving won't be a problem

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
Hi, we are using avro with compression(snappy). As soon as you have enough partitions, the saving won't be a problem imho. in general hdfs is pretty fast, s3 is less so the issue with storing data is that you will loose your partitioner(even though rdd has it) at loading moment. There is PR that

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
OK. I found the following code that does that. def readParquetRDD[T <% SpecificRecord](sc: SparkContext, parquetFile: String)(implicit tag: ClassTag[T]): RDD[T] = { val jobConf= new JobConf(sc.hadoopConfiguration) ParquetInputFormat.setReadSupportClass(jobConf, classOf[AvroReadSupport[T]])

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
java/scala? I think there is everything in dataframes tutorial *e.g. if u have dataframe and working from java - toJavaRDD * () On 5 November 2015 at 21:13, swetha kasireddy

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
No scala. Suppose I read the Parquet file as shown in the following. How would that be converted to an RDD to use it in my Spark Batch. I use Core Spark. I don't use Spark SQL. ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[ AminoAcid]]) val file =

Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread swetha
Hi, What is the efficient approach to save an RDD as a file in HDFS and retrieve it back? I was thinking between Avro, Parquet and SequenceFileFormart. We currently use SequenceFileFormart for one of our use cases. Any example on how to store and retrieve an RDD in an Avro and Parquet file

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread Stefano Baghino
What scenario would you like to optimize for? If you have something more specific regarding your use case, the mailing list can surely provide you with some very good advice. If you just want to save an RDD as Avro you can use a module from Databricks (the README on GitHub