Well, if you still want to use windows function for your logic, then you need 
to derive a new column out, like "catalog", and use it as part of grouping 
logic.


Maybe you can use regex for deriving out this new column. The implementation 
needs to depend on your data in "transactiondescription", and regex gives you 
the most powerful way to handle your data.


This is really not a Spark question, but how to you process your logic based on 
the data given.


Yong


________________________________
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: Tuesday, August 2, 2016 10:00 AM
To: user @spark
Subject: Extracting key word from a textual column

Hi,

Need some ideas.

Summary:

I am working on a tool to slice and dice the amount of money I have spent so 
far (meaning the whole data sample) on a given retailer so I have a better idea 
of where I am wasting the money

Approach

Downloaded my bank statements from a given account in csv format from inception 
till end of July. Read the data and stored it in ORC table.

I am interested for all bills that I paid using Debit Card ( transactiontype = 
"DEB") that comes out the account directly. Transactiontype is the three 
character code lookup that I download as well.

scala> ll_18740868.printSchema
root
 |-- transactiondate: date (nullable = true)
 |-- transactiontype: string (nullable = true)
 |-- sortcode: string (nullable = true)
 |-- accountnumber: string (nullable = true)
 |-- transactiondescription: string (nullable = true)
 |-- debitamount: double (nullable = true)
 |-- creditamount: double (nullable = true)
 |-- balance: double (nullable = true)

The important fields are transactiondate, transactiontype, 
transactiondescription and debitamount

So using analytics. windowing I can do all sorts of things. For example this 
one gives me the last time I spent money on retailer XYZ and the amount

SELECT *
FROM (
      select transactiondate, transactiondescription, debitamount
      , rank() over (order by transactiondate desc) AS rank
      from accounts.ll_18740868 where transactiondescription like '%XYZ%'
     ) tmp
where rank <= 1

And its equivalent using Windowing in FP

import org.apache.spark.sql.expressions.Window
val wSpec = 
Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
 rank().over(wSpec).as("rank")).filter($"rank"===1).show


+---------------+----------------------+----+
|transactiondate|transactiondescription|rank|
+---------------+----------------------+----+
|     2015-12-15|  XYZ LTD CD 4636 |   1|
+---------------+----------------------+----+

So far so good. But if I want to find all I spent on each retailer, then it 
gets trickier as a retailer appears like below in the column 
transactiondescription:

ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
+----------------------+
|transactiondescription|
+----------------------+
|  SAINSBURYS SMKT C...|
|  SACAT SAINSBURYS ...|
|  SAINSBURY'S SMKT ...|
|  SAINSBURYS S/MKT ...|
|  SACAT SAINSBURYS ...|
+----------------------+

If I look at them I know they all belong to SAINBURYS (food retailer). I have 
done some crude grouping and it works somehow

//define UDF here to handle substring
val SubstrUDF = udf { (s: String, start: Int, end: Int) => s.substring(start, 
end) }
var cutoff = "CD"  // currently used in the statement
val wSpec2 = 
Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
 cutoff)-1))
ll_18740868.where($"transactiontype" === "DEB" && 
($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
 
cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)

However, I really need to extract the "keyword" retailer name from 
transactiondescription column And I need some ideas about the best way of doing 
it. Is this possible in Spark?


Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.


Reply via email to