Folks-
I am very new to Spark and Spark-SQL.
Here is what I am doing in my application.
Can you please validate and let me know if there is a better way?
1. Parsing XML files with nested structures, ingested, into individual
datasets
Created a custom input format to split XML so each node becomes a record in my
RDD.
Used Spark scala libraries to parse XML
2. I am creating a partition key, for each record, and then using it to
generate a key value pair RDD
3. Saving distinct partition keys into an array, iterating through it and and
creating Hive partition directories and issuing Hive add partition command
“Alter table x add partition()..” - using Spark HiveContext
3. Saving RDDs leveraging MultipleOutputs - so each file generated has a
custom filename that includes partition key
4. Programmatically moving files generated to Hive partitions leveraging the
custom filenames
5. Build a merged dataset that is denormalized for performance - Spark
program, hive query passed as parameter, Hive query executed using Spark Sql
HiveContext
6. I need to build additional datasets - and plan to use approach detailed in
#5.
Can you please critique and let me know if there is a better way using Spark,
of building this data pipeline?
We dont plan to use Hive on MR whatsoever, the tables are purely to support
leveraging HQL with Spark Sql.
Thanks.
Vajra
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org