Re: Bzip2 to Parquet format

Andrew Ehrlich Sun, 24 Jul 2016 20:01:10 -0700

You can load the text with sc.textFile() to an RDD[String], then use .map() to 
convert it into an RDD[Row]. At this point you are ready to apply a schema. Use 
sqlContext.createDataFrame(rddOfRow, structType)

Here is an example on how to define the StructType (schema) that you will 
combine with the RDD[Row] to create a DataFrame.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType

<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType>

Once you have the DataFrame, save it to parquet with dataframe.save(“/path”) to 
create a parquet file.

Reference for SQLContext / createDataFrame: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext

<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext>

> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhan...@gmail.com> wrote:
> 
> We have data in Bz2 compression format. Any links in Spark to convert into 
> Parquet and also performance benchmarks and uses study materials ?

Re: Bzip2 to Parquet format

Reply via email to