I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file. My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro?
Thanks in advance Anita On 18 June 2014 05:03, Michael Armbrust <mich...@databricks.com> wrote: > If you convert the data to a SchemaRDD you can save it as Parquet: > http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet > > > On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) < > mahesh.padmanab...@twc-contractor.com> wrote: > >> Thanks Krishna. Seems like you have to use Avro and then convert that >> to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll >> look into this some more. >> >> Thanks, >> Mahesh >> >> From: Krishna Sankar <ksanka...@gmail.com> >> Reply-To: "user@spark.apache.org" <user@spark.apache.org> >> Date: Tuesday, June 17, 2014 at 2:41 PM >> To: "user@spark.apache.org" <user@spark.apache.org> >> Subject: Re: Spark streaming RDDs to Parquet records >> >> Mahesh, >> >> - One direction could be : create a parquet schema, convert & save >> the records to hdfs. >> - This might help >> >> https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala >> >> Cheers >> <k/> >> >> >> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc < >> mahesh.padmanab...@twc-contractor.com> wrote: >> >>> Hello, >>> >>> Is there an easy way to convert RDDs within a DStream into Parquet >>> records? >>> Here is some incomplete pseudo code: >>> >>> // Create streaming context >>> val ssc = new StreamingContext(...) >>> >>> // Obtain a DStream of events >>> val ds = KafkaUtils.createStream(...) >>> >>> // Get Spark context to get to the SQL context >>> val sc = ds.context.sparkContext >>> >>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) >>> >>> // For each RDD >>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => { >>> >>> // What do I do next? >>> }) >>> >>> Thanks, >>> Mahesh >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >> >> ------------------------------ >> This E-mail and any of its attachments may contain Time Warner Cable >> proprietary information, which is privileged, confidential, or subject to >> copyright belonging to Time Warner Cable. This E-mail is intended solely >> for the use of the individual or entity to which it is addressed. If you >> are not the intended recipient of this E-mail, you are hereby notified that >> any dissemination, distribution, copying, or action taken in relation to >> the contents of and attachments to this E-mail is strictly prohibited and >> may be unlawful. If you have received this E-mail in error, please notify >> the sender immediately and permanently delete the original and any copy of >> this E-mail and any printout. >> > >