Please click on unnamed text/html  link for better view

On Tue, Aug 29, 2017 at 8:11 PM purna pradeep <purna2prad...@gmail.com>
wrote:

>
> ---------- Forwarded message ---------
> From: Mamillapalli, Purna Pradeep <
> purnapradeep.mamillapa...@capitalone.com>
> Date: Tue, Aug 29, 2017 at 8:08 PM
> Subject: Spark question
> To: purna pradeep <purna2prad...@gmail.com>
>
> Below is the input Dataframe(In real this is a very large Dataframe)
>
>
>
> EmployeeID
>
> INCOME
>
> INCOME AGE TS
>
> JoinDate
>
> Dept
>
> 101
>
> 19000
>
> 4/20/17
>
> 4/20/13
>
> ES
>
> 101
>
> 10000
>
> 10/3/15
>
> 4/20/13
>
> OS
>
> 102
>
> 13000
>
> 5/9/17
>
> 4/20/12
>
> DS
>
> 102
>
> 12000
>
> 5/8/17
>
> 4/20/12
>
> CS
>
> 103
>
> 10000
>
> 5/9/17
>
> 4/20/10
>
> EQ
>
> 103
>
> 9000
>
> 5/8/15
>
> 4/20/10
>
> MD
>
> Get the latest income of an employee which has  Income_age ts <10 months
>
> Expected output Dataframe
>
> EmployeeID
>
> INCOME
>
> INCOME AGE TS
>
> JoinDate
>
> Dept
>
> 101
>
> 19000
>
> 4/20/17
>
> 4/20/13
>
> ES
>
> 102
>
> 13000
>
> 5/9/17
>
> 4/20/12
>
> DS
>
> 103
>
> 10000
>
> 5/9/17
>
> 4/20/10
>
> EQ
>
>
>





Below is what im planning to implement
>
>
>
> case class employee (*EmployeeID*: Int, *INCOME*: Int, INCOMEAGE: Int,
> *JOINDATE*: Int,DEPT:String)
>
>
>
> *val *empSchema = *new *StructType().add(*"EmployeeID"*,*"Int"*).add(
> *"INCOME"*, *"Int"*).add(*"INCOMEAGE"*,*"Date"*) . add(*"JOINDATE"*,
> *"Date"*). add(*"DEPT"*,*"String"*)
>
>
>
> *//Reading from the File **import *sparkSession.implicits._
>
> *val *readEmpFile = sparkSession.read
>   .option(*"sep"*, *","*)
>   .schema(empSchema)
>   .csv(INPUT_DIRECTORY)
>
>
> *//Create employee DataFrame **val *custDf = readEmpFile.as[employee]
>
>
> *//Adding Salary Column **val *groupByDf = custDf.groupByKey(a => a.*
> EmployeeID*)
>
>
> *val *k = groupByDf.mapGroups((key,value) => performETL(value))
>
>
>
>
>
> *def *performETL(empData: Iterator[employee]) : new employee  = {
>
>   *val *empList = empData.toList
>
>
> *//calculate income has Logic to figureout latest income for an account
> and returns latest income   val *income = calculateIncome(empList)
>
>
>   *for *(i <- empList) {
>
>       *val *row = i
>
> *return new *employee(row.EmployeeID, row.INCOMEAGE , income)
>   }
>   *return  "Done"*
>
>
>
> }
>
>
>
> Is this a better approach or even the right approach to implement the
> same.If not please suggest a better way to implement the same?
>
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Reply via email to