Folks- I am very new to Spark and Spark-SQL. Here is what I am doing in my application. Can you please validate and let me know if there is a better way?
1. Parsing XML files with nested structures, ingested, into individual datasets Created a custom input format to split XML so each node becomes a record in my RDD. Used Spark scala libraries to parse XML 2. I am creating a partition key, for each record, and then using it to generate a key value pair RDD 3. Saving distinct partition keys into an array, iterating through it and and creating Hive partition directories and issuing Hive add partition command “Alter table x add partition()..” - using Spark HiveContext 3. Saving RDDs leveraging MultipleOutputs - so each file generated has a custom filename that includes partition key 4. Programmatically moving files generated to Hive partitions leveraging the custom filenames 5. Build a merged dataset that is denormalized for performance - Spark program, hive query passed as parameter, Hive query executed using Spark Sql HiveContext 6. I need to build additional datasets - and plan to use approach detailed in #5. Can you please critique and let me know if there is a better way using Spark, of building this data pipeline? We dont plan to use Hive on MR whatsoever, the tables are purely to support leveraging HQL with Spark Sql. Thanks. Vajra --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org