JsonRDD to parquet -- data loss

Vasu C Tue, 17 Feb 2015 20:02:48 -0800

Hi,

I am running spark batch processing job using spark-submit command. And
below is my code snippet.  Basically converting JsonRDD to parquet and
storing it in HDFS location.


The problem I am facing is if multiple jobs are are triggered parallely,
even though job executes properly (as i can see in spark webUI), there is
no parquet file created in hdfs path. If 5 jobs are executed parallely than
only 3 parquet files are getting created.

Is this the data loss scenario ? Or am I missing something here. Please
help me in this

Here tableName is unique with timestamp appended to it.


val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val jsonRdd  = sqlContext.jsonRDD(results)

val parquetTable = sqlContext.parquetFile(parquetFilePath)

parquetTable.registerTempTable(tableName)

jsonRdd.insertInto(tableName)


Regards,

  Vasu C

JsonRDD to parquet -- data loss

Reply via email to