Folks-
I am very new to Spark and Spark-SQL.

Here is what I am doing in my application.
Can you please validate and let me know if there is a better way?

1.  Parsing XML files with nested structures, ingested, into individual 
datasets 
Created a custom input format to split XML so each node becomes a record in my 
RDD.
Used Spark scala libraries to parse XML

2.  I am creating a partition key, for each record, and then using it to 
generate a key value pair RDD

3.  Saving distinct partition keys into an array, iterating through it and and 
creating Hive partition directories and issuing Hive add partition command 
“Alter table x add partition()..” - using Spark HiveContext

3.  Saving RDDs leveraging MultipleOutputs - so each file generated has a 
custom filename that includes partition key

4.  Programmatically moving files generated to Hive partitions leveraging the 
custom filenames

5.  Build a merged dataset that is denormalized for performance - Spark 
program, hive query passed as parameter, Hive query executed using Spark Sql 
HiveContext

6.  I need to build additional datasets - and plan to use approach detailed in 
#5.

Can you please critique and let me know if there is a better way using Spark, 
of building this data pipeline?
We dont plan to use Hive on MR whatsoever, the tables are purely to support 
leveraging HQL with Spark Sql.

Thanks.
Vajra
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to