You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
Here is an example on how to define the StructType (schema) that you will combine with the RDD[Row] to create a DataFrame. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType> Once you have the DataFrame, save it to parquet with dataframe.save(“/path”) to create a parquet file. Reference for SQLContext / createDataFrame: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext> > On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhan...@gmail.com> wrote: > > We have data in Bz2 compression format. Any links in Spark to convert into > Parquet and also performance benchmarks and uses study materials ?