[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034650#comment-15034650 ]
swetha k edited comment on SPARK-11620 at 12/1/15 9:50 PM: ----------------------------------------------------------- [~hyukjin.kwon] I have the following code that saves the parquet files in my hourly batch to hdfs and the code is based on the github link in the end. And the WARNING message that I get is as shown in the previous comments. Any idea as to why this is happening? val job = Job.getInstance() var filePath = "path" val metricsPath: Path = new Path(filePath) //Check if inputFile exists val fs: FileSystem = FileSystem.get(job.getConfiguration) if (fs.exists(metricsPath)) { fs.delete(metricsPath, true) } // Configure the ParquetOutputFormat to use Avro as the serialization format ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) // You need to pass the schema to AvroParquet when you are writing objects but not when you // are reading them. The schema is saved in Parquet file for future readers to use. AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$) // Create a PairRDD with all keys set to null and wrap each Metrics in serializable objects val metricsToBeSaved = metrics.map(metricRecord => (null, new SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1, metricRecord._2._2)))); metricsToBeSaved.coalesce(1500) // Save the RDD to a Parquet file in our temporary output directory metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void], classOf[Metrics], classOf[ParquetOutputFormat[Metrics]], job.getConfiguration) https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala was (Author: swethakasireddy): [~hyukjin.kwon] I have the following code that saves the parquet files in my hourly batch to hdfs and the code is based on the github link in the end. val job = Job.getInstance() var filePath = "path" val metricsPath: Path = new Path(filePath) //Check if inputFile exists val fs: FileSystem = FileSystem.get(job.getConfiguration) if (fs.exists(metricsPath)) { fs.delete(metricsPath, true) } // Configure the ParquetOutputFormat to use Avro as the serialization format ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) // You need to pass the schema to AvroParquet when you are writing objects but not when you // are reading them. The schema is saved in Parquet file for future readers to use. AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$) // Create a PairRDD with all keys set to null and wrap each Metrics in serializable objects val metricsToBeSaved = metrics.map(metricRecord => (null, new SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1, metricRecord._2._2)))); metricsToBeSaved.coalesce(1500) // Save the RDD to a Parquet file in our temporary output directory metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void], classOf[Metrics], classOf[ParquetOutputFormat[Metrics]], job.getConfiguration) https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > -------------------------------------------------------------------------------------------- > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org