I neglected to include the rationale: the assumption is this will be a repeatedly needed process thus a reusable method were helpful. The predicate/input rules that are supported will need to be flexible enough to support the range of input data domains and use cases . For my workflows the predicates are typically sql's.
Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch <java...@gmail.com>: > Hi Mich! > I think you can combine the good/rejected into one method that > internally: > > - Create good/rejected df's given an input df and input > rules/predicates to apply to the df. > - Create a third df containing the good rows and the rejected rows > with the bad columns nulled out > - Append/insert the two dfs into their respective hive good/exception > tables > - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf) > or maybe just the (combinedDf,exceptionsDf) > > > Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh < > mich.talebza...@gmail.com>: > >> >> Hi, >> >> I have a Spark Scala program created and compiled with Maven. It works >> fine. It basically does the following: >> >> >> 1. Reads an xml file from HDFS location >> 2. Creates a DF on top of what it reads >> 3. Creates a new DF with some columns renamed etc >> 4. Creates a new DF for rejected rows (incorrect value for a column) >> 5. Puts rejected data into Hive exception table >> 6. Puts valid rows into Hive main table >> 7. Nullifies the invalid rows by setting the invalid column to NULL >> and puts the rows into the main Hive table >> >> These are currently performed in one method. Ideally I want to break this >> down as follows: >> >> >> 1. A method to read the XML file and creates DF and a new DF on top >> of previous DF >> 2. A method to create a DF on top of rejected rows using t >> 3. A method to put invalid rows into the exception table using tmp >> table >> 4. A method to put the correct rows into the main table again using >> tmp table >> >> I was wondering if this is correct approach? >> >> Thanks, >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >