Re: Bzip2 to Parquet format

2016-07-25 Thread Takeshi Yamamuro
Hi, This is the expected behaivour. A default compression for parquet is `snappy`. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L215 // maropu On Tue, Jul 26, 2016 at 6:33 AM, janardhan shetty

Re: Bzip2 to Parquet format

2016-07-25 Thread janardhan shetty
Andrew, 2.0 I tried val inputR = sc.textFile(file) val inputS = inputR.map(x => x.split("`")) val inputDF = inputS.toDF() inputDF.write.format("parquet").save(result.parquet) Result part files end with *.snappy.parquet *is that expected ? On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich

Re: Bzip2 to Parquet format

2016-07-24 Thread Andrew Ehrlich
You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType) Here is an example on how to define the StructType (schema) that you will combine with

Bzip2 to Parquet format

2016-07-24 Thread janardhan shetty
We have data in Bz2 compression format. Any links in Spark to convert into Parquet and also performance benchmarks and uses study materials ?