Re: JsonRDD to parquet -- data loss

Arush Kharbanda Tue, 17 Feb 2015 23:55:02 -0800

I am not sure, if this the easiest way to solve your problem. But you can
connect to the HIVE metastore(through derby) and find the HDFS path from
there.


On Wed, Feb 18, 2015 at 9:31 AM, Vasu C <vasuc.bigd...@gmail.com> wrote:

> Hi,
>
> I am running spark batch processing job using spark-submit command. And
> below is my code snippet.  Basically converting JsonRDD to parquet and
> storing it in HDFS location.
>
> The problem I am facing is if multiple jobs are are triggered parallely,
> even though job executes properly (as i can see in spark webUI), there is
> no parquet file created in hdfs path. If 5 jobs are executed parallely than
> only 3 parquet files are getting created.
>
> Is this the data loss scenario ? Or am I missing something here. Please
> help me in this
>
> Here tableName is unique with timestamp appended to it.
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> val jsonRdd  = sqlContext.jsonRDD(results)
>
> val parquetTable = sqlContext.parquetFile(parquetFilePath)
>
> parquetTable.registerTempTable(tableName)
>
> jsonRdd.insertInto(tableName)
>
>
> Regards,
>
>   Vasu C
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: JsonRDD to parquet -- data loss

Reply via email to