Re: Bzip2 to Parquet format

janardhan shetty Mon, 25 Jul 2016 14:34:13 -0700

Andrew,

2.0


I tried
val inputR = sc.textFile(file)
val inputS = inputR.map(x => x.split("`"))
val inputDF = inputS.toDF()

inputDF.write.format("parquet").save(result.parquet)

Result part files end with *.snappy.parquet *is that expected ?

On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich <and...@aehrlich.com> wrote:

> You can load the text with sc.textFile() to an RDD[String], then use
> .map() to convert it into an RDD[Row]. At this point you are ready to
> apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
>
> Here is an example on how to define the StructType (schema) that you will
> combine with the RDD[Row] to create a DataFrame.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
>
> Once you have the DataFrame, save it to parquet with
> dataframe.save(“/path”) to create a parquet file.
>
> Reference for SQLContext / createDataFrame:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>
>
>
> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
> We have data in Bz2 compression format. Any links in Spark to convert into
> Parquet and also performance benchmarks and uses study materials ?
>
>
>

Re: Bzip2 to Parquet format

Reply via email to