subject:"Re\: JsonRDD to parquet \-\- data loss"

Re: JsonRDD to parquet -- data loss

2015-02-18 Thread Michael Armbrust

Concurrent inserts into the same table are not supported.  I can try to
make this clearer in the documentation.

On Tue, Feb 17, 2015 at 8:01 PM, Vasu C vasuc.bigd...@gmail.com wrote:

 Hi,

 I am running spark batch processing job using spark-submit command. And
 below is my code snippet.  Basically converting JsonRDD to parquet and
 storing it in HDFS location.

 The problem I am facing is if multiple jobs are are triggered parallely,
 even though job executes properly (as i can see in spark webUI), there is
 no parquet file created in hdfs path. If 5 jobs are executed parallely than
 only 3 parquet files are getting created.

 Is this the data loss scenario ? Or am I missing something here. Please
 help me in this

 Here tableName is unique with timestamp appended to it.


 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 val jsonRdd  = sqlContext.jsonRDD(results)

 val parquetTable = sqlContext.parquetFile(parquetFilePath)

 parquetTable.registerTempTable(tableName)

 jsonRdd.insertInto(tableName)


 Regards,

   Vasu C

Re: JsonRDD to parquet -- data loss

2015-02-17 Thread Arush Kharbanda

I am not sure, if this the easiest way to solve your problem. But you can
connect to the HIVE metastore(through derby) and find the HDFS path from
there.

On Wed, Feb 18, 2015 at 9:31 AM, Vasu C vasuc.bigd...@gmail.com wrote:

 Hi,

 I am running spark batch processing job using spark-submit command. And
 below is my code snippet.  Basically converting JsonRDD to parquet and
 storing it in HDFS location.

 The problem I am facing is if multiple jobs are are triggered parallely,
 even though job executes properly (as i can see in spark webUI), there is
 no parquet file created in hdfs path. If 5 jobs are executed parallely than
 only 3 parquet files are getting created.

 Is this the data loss scenario ? Or am I missing something here. Please
 help me in this

 Here tableName is unique with timestamp appended to it.


 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 val jsonRdd  = sqlContext.jsonRDD(results)

 val parquetTable = sqlContext.parquetFile(parquetFilePath)

 parquetTable.registerTempTable(tableName)

 jsonRdd.insertInto(tableName)


 Regards,

   Vasu C




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: JsonRDD to parquet -- data loss

Re: JsonRDD to parquet -- data loss

2 matches

Site Navigation

Mail list logo

Footer information