[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

swetha k (JIRA) Tue, 01 Dec 2015 13:51:49 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034650#comment-15034650
 ]


swetha k edited comment on SPARK-11620 at 12/1/15 9:50 PM:
-----------------------------------------------------------

[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end.  And the WARNING 
message that I get is as shown in the previous comments. Any idea as to why 
this is happening?

        val job = Job.getInstance()
        var filePath = "path"
        val metricsPath: Path = new Path(filePath)
        //Check if inputFile exists
        val fs: FileSystem = FileSystem.get(job.getConfiguration)

        if (fs.exists(metricsPath)) {
          fs.delete(metricsPath, true)
        }

        // Configure the ParquetOutputFormat to use Avro as the
serialization format
        ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
        // You need to pass the schema to AvroParquet when you are writing
objects but not when you
        // are reading them. The schema is saved in Parquet file for future
readers to use.
        AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


        // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
        val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

        metricsToBeSaved.coalesce(1500)
        // Save the RDD to a Parquet file in our temporary output directory
        metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
          classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala


was (Author: swethakasireddy):
[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end. 

        val job = Job.getInstance()
        var filePath = "path"
        val metricsPath: Path = new Path(filePath)
        //Check if inputFile exists
        val fs: FileSystem = FileSystem.get(job.getConfiguration)

        if (fs.exists(metricsPath)) {
          fs.delete(metricsPath, true)
        }

        // Configure the ParquetOutputFormat to use Avro as the
serialization format
        ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
        // You need to pass the schema to AvroParquet when you are writing
objects but not when you
        // are reading them. The schema is saved in Parquet file for future
readers to use.
        AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


        // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
        val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

        metricsToBeSaved.coalesce(1500)
        // Save the RDD to a Parquet file in our temporary output directory
        metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
          classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11620
>                 URL: https://issues.apache.org/jira/browse/SPARK-11620
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

Reply via email to