Hi, I have a Spark Scala program created and compiled with Maven. It works fine. It basically does the following:
1. Reads an xml file from HDFS location 2. Creates a DF on top of what it reads 3. Creates a new DF with some columns renamed etc 4. Creates a new DF for rejected rows (incorrect value for a column) 5. Puts rejected data into Hive exception table 6. Puts valid rows into Hive main table 7. Nullifies the invalid rows by setting the invalid column to NULL and puts the rows into the main Hive table These are currently performed in one method. Ideally I want to break this down as follows: 1. A method to read the XML file and creates DF and a new DF on top of previous DF 2. A method to create a DF on top of rejected rows using t 3. A method to put invalid rows into the exception table using tmp table 4. A method to put the correct rows into the main table again using tmp table I was wondering if this is correct approach? Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.