Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
I neglected to include the rationale: the assumption is this will be a
repeatedly needed process thus a reusable method were helpful.  The
predicate/input rules that are supported will need to be flexible enough to
support the range of input data domains and use cases .  For my workflows
the predicates are typically sql's.

Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch :

> Hi Mich!
>I think you can combine the good/rejected into one method that
> internally:
>
>- Create good/rejected df's given an input df and input
>rules/predicates to apply to the df.
>- Create a third df containing the good rows and the rejected rows
>with the bad columns nulled out
>- Append/insert the two dfs into their respective hive good/exception
>tables
>- return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
>or maybe just the (combinedDf,exceptionsDf)
>
>
> Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>>
>> Hi,
>>
>> I have a Spark Scala program created and compiled with Maven. It works
>> fine. It basically does the following:
>>
>>
>>1. Reads an xml file from HDFS location
>>2. Creates a DF on top of what it reads
>>3. Creates a new DF with some columns renamed etc
>>4. Creates a new DF for rejected rows (incorrect value for a column)
>>5. Puts rejected data into Hive exception table
>>6. Puts valid rows into Hive main table
>>7. Nullifies the invalid rows by setting the invalid column to NULL
>>and puts the rows into the main Hive table
>>
>> These are currently performed in one method. Ideally I want to break this
>> down as follows:
>>
>>
>>1. A method to read the XML file and creates DF and a new DF on top
>>of previous DF
>>2. A method to create a DF on top of rejected rows using t
>>3. A method to put invalid rows into the exception table using tmp
>>table
>>4. A method to put the correct rows into the main table again using
>>tmp table
>>
>> I was wondering if this is correct approach?
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich!
   I think you can combine the good/rejected into one method that
internally:

   - Create good/rejected df's given an input df and input rules/predicates
   to apply to the df.
   - Create a third df containing the good rows and the rejected rows with
   the bad columns nulled out
   - Append/insert the two dfs into their respective hive good/exception
   tables
   - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
   or maybe just the (combinedDf,exceptionsDf)


Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
mich.talebza...@gmail.com>:

>
> Hi,
>
> I have a Spark Scala program created and compiled with Maven. It works
> fine. It basically does the following:
>
>
>1. Reads an xml file from HDFS location
>2. Creates a DF on top of what it reads
>3. Creates a new DF with some columns renamed etc
>4. Creates a new DF for rejected rows (incorrect value for a column)
>5. Puts rejected data into Hive exception table
>6. Puts valid rows into Hive main table
>7. Nullifies the invalid rows by setting the invalid column to NULL
>and puts the rows into the main Hive table
>
> These are currently performed in one method. Ideally I want to break this
> down as follows:
>
>
>1. A method to read the XML file and creates DF and a new DF on top of
>previous DF
>2. A method to create a DF on top of rejected rows using t
>3. A method to put invalid rows into the exception table using tmp
>table
>4. A method to put the correct rows into the main table again using
>tmp table
>
> I was wondering if this is correct approach?
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>