Re: Spark streaming RDDs to Parquet records

Anita Tailor Thu, 19 Jun 2014 11:53:57 -0700

I have similar case where I have RDD [List[Any], List[Long] ] and wants to
save it as Parquet file.
My understanding is that only RDD of case classes can be converted to
SchemaRDD. So is there any way I can save this RDD as Parquet file without
using Avro?


Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust <mich...@databricks.com> wrote:

> If you convert the data to a SchemaRDD you can save it as Parquet:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet
>
>
> On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <
> mahesh.padmanab...@twc-contractor.com> wrote:
>
>>  Thanks Krishna. Seems like you have to use Avro and then convert that
>> to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll
>> look into this some more.
>>
>>  Thanks,
>> Mahesh
>>
>>   From: Krishna Sankar <ksanka...@gmail.com>
>> Reply-To: "user@spark.apache.org" <user@spark.apache.org>
>> Date: Tuesday, June 17, 2014 at 2:41 PM
>> To: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Re: Spark streaming RDDs to Parquet records
>>
>>  Mahesh,
>>
>>    - One direction could be : create a parquet schema, convert & save
>>    the records to hdfs.
>>    - This might help
>>    
>> https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
>>
>>  Cheers
>> <k/>
>>
>>
>> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <
>> mahesh.padmanab...@twc-contractor.com> wrote:
>>
>>> Hello,
>>>
>>> Is there an easy way to convert RDDs within a DStream into Parquet
>>> records?
>>> Here is some incomplete pseudo code:
>>>
>>> // Create streaming context
>>> val ssc = new StreamingContext(...)
>>>
>>> // Obtain a DStream of events
>>> val ds = KafkaUtils.createStream(...)
>>>
>>> // Get Spark context to get to the SQL context
>>> val sc = ds.context.sparkContext
>>>
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>
>>> // For each RDD
>>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => {
>>>
>>>     // What do I do next?
>>> })
>>>
>>> Thanks,
>>> Mahesh
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>> ------------------------------
>> This E-mail and any of its attachments may contain Time Warner Cable
>> proprietary information, which is privileged, confidential, or subject to
>> copyright belonging to Time Warner Cable. This E-mail is intended solely
>> for the use of the individual or entity to which it is addressed. If you
>> are not the intended recipient of this E-mail, you are hereby notified that
>> any dissemination, distribution, copying, or action taken in relation to
>> the contents of and attachments to this E-mail is strictly prohibited and
>> may be unlawful. If you have received this E-mail in error, please notify
>> the sender immediately and permanently delete the original and any copy of
>> this E-mail and any printout.
>>
>
>

Re: Spark streaming RDDs to Parquet records

Reply via email to