There’s a months_between function you could use, as well:

df.filter(months_between(current_date, $”Payment Date”) > 6).show

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Monday, March 21, 2016 at 5:53 PM
To: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Work out date column in CSV more than 6 months old (datediff or 
something)

Hi,

For test purposes I am reading in a simple csv file as follows:

val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", "true").load("/data/stg/table2")
df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: 
string, Net: string, VAT: string, Total: string]

For this work I am interested in column "Payment Date" > 6 months old from today

Data is stored in the following format for that column

scala> df.select("Payment date").take(2)
res40: Array[org.apache.spark.sql.Row] = Array([10/02/2014], [17/02/2014])

stored as 'dd/MM/yyyy'

The current time I get as

scala> val today = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 
'dd/MM/yyyy') ").collect.apply(0).getString(0)
today: String = 21/03/2016


So I want to filter the csv file

scala>  df.filter(col("Payment date") < lit(today)).show(2)
+--------------+------------+---------+-----+---------+
|Invoice Number|Payment date|      Net|  VAT|    Total|
+--------------+------------+---------+-----+---------+
|           360|  10/02/2014|?2,500.00|?0.00|?2,500.00|
|           361|  17/02/2014|?2,500.00|?0.00|?2,500.00|
+--------------+------------+---------+-----+---------+


However, I want to use datediff() function here not just < today!


Obviously one can store the file as a table and use SQL on it. However, I want 
to see if there are other ways using fp.

Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


Reply via email to