Re: Modularising Spark/Scala program

Stephen Boesch Sat, 02 May 2020 06:17:21 -0700

I neglected to include the rationale: the assumption is this will be a
repeatedly needed process thus a reusable method were helpful.  The
predicate/input rules that are supported will need to be flexible enough to
support the range of input data domains and use cases .  For my workflows
the predicates are typically sql's.


Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch <java...@gmail.com>:

> Hi Mich!
>    I think you can combine the good/rejected into one method that
> internally:
>
>    - Create good/rejected df's given an input df and input
>    rules/predicates to apply to the df.
>    - Create a third df containing the good rows and the rejected rows
>    with the bad columns nulled out
>    - Append/insert the two dfs into their respective hive good/exception
>    tables
>    - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
>    or maybe just the (combinedDf,exceptionsDf)
>
>
> Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>>
>> Hi,
>>
>> I have a Spark Scala program created and compiled with Maven. It works
>> fine. It basically does the following:
>>
>>
>>    1. Reads an xml file from HDFS location
>>    2. Creates a DF on top of what it reads
>>    3. Creates a new DF with some columns renamed etc
>>    4. Creates a new DF for rejected rows (incorrect value for a column)
>>    5. Puts rejected data into Hive exception table
>>    6. Puts valid rows into Hive main table
>>    7. Nullifies the invalid rows by setting the invalid column to NULL
>>    and puts the rows into the main Hive table
>>
>> These are currently performed in one method. Ideally I want to break this
>> down as follows:
>>
>>
>>    1. A method to read the XML file and creates DF and a new DF on top
>>    of previous DF
>>    2. A method to create a DF on top of rejected rows using t
>>    3. A method to put invalid rows into the exception table using tmp
>>    table
>>    4. A method to put the correct rows into the main table again using
>>    tmp table
>>
>> I was wondering if this is correct approach?
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Modularising Spark/Scala program

Reply via email to