Modularising Spark/Scala program

Mich Talebzadeh Sat, 02 May 2020 06:01:12 -0700

Hi,

I have a Spark Scala program created and compiled with Maven. It works
fine. It basically does the following:



   1. Reads an xml file from HDFS location
   2. Creates a DF on top of what it reads
   3. Creates a new DF with some columns renamed etc
   4. Creates a new DF for rejected rows (incorrect value for a column)
   5. Puts rejected data into Hive exception table
   6. Puts valid rows into Hive main table
   7. Nullifies the invalid rows by setting the invalid column to NULL and
   puts the rows into the main Hive table

These are currently performed in one method. Ideally I want to break this
down as follows:


   1. A method to read the XML file and creates DF and a new DF on top of
   previous DF
   2. A method to create a DF on top of rejected rows using t
   3. A method to put invalid rows into the exception table using tmp table
   4. A method to put the correct rows into the main table again using tmp
   table

I was wondering if this is correct approach?

Thanks,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Modularising Spark/Scala program

Reply via email to