Re: Work out date column in CSV more than 6 months old (datediff or something)

James Hammerton Tue, 22 Mar 2016 04:38:22 -0700

On 22 March 2016 at 10:57, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:


> Thanks Silvio.
>
> The problem I have is that somehow string comparison does not work.
>
> Case in point
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load("/data/stg/table2")
> val current_date = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),
> 'dd/MM/yyyy') ").collect.apply(0).getString(0)
> df.filter(*lit(current_date) < col("Payment 
> date"*)).select(lit(current_date).alias("current_date"),
> col("Payment date").alias("PaymentDate")).show(5)
>
>
This is doing a string comparison not a date comparison (assuming "Payment
date" is of type String).

E.g.

scala> "22/03/2016" < "24/02/2015"
>
> res4: Boolean = true
>
>
>> scala> "22/03/2016" < "04/02/2015"
>
> res5: Boolean = false
>
>
This is the correct result for a string comparison but it's not the
comparison you want.

I think you need to convert the "Payment date" with "to_date" and compare
against that.

E.g. something like: df.filter(current_date() < to_date(col("Payment
date")))

Regards,

James



> It selects all the rows that are less than today's date (they are old).
>
> +------------+-----------+
> |current_date|PaymentDate|
> +------------+-----------+
> |  22/03/2016| 24/02/2014|
> |  22/03/2016| 24/03/2014|
> |  22/03/2016| 31/03/2015|
> |  22/03/2016| 28/04/2014|
> |  22/03/2016| 26/05/2014|
> +------------+-----------+
>
> I don't know why this comparison is failing. May be it is comparing the
> first two leftmost characters?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 March 2016 at 00:26, Silvio Fiorito <silvio.fior...@granturing.com>
> wrote:
>
>> There’s a months_between function you could use, as well:
>>
>> df.filter(months_between(current_date, $”Payment Date”) > 6).show
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date: Monday, March 21, 2016 at 5:53 PM
>> To: "user @spark" <user@spark.apache.org>
>> Subject: Work out date column in CSV more than 6 months old (datediff or
>> something)
>>
>> Hi,
>>
>> For test purposes I am reading in a simple csv file as follows:
>>
>> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "true").load("/data/stg/table2")
>> df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment
>> date: string, Net: string, VAT: string, Total: string]
>>
>> For this work I am interested in column "Payment Date" > 6 months old
>> from today
>>
>> Data is stored in the following format for that column
>>
>> scala> df.select("Payment date").take(2)
>> res40: Array[org.apache.spark.sql.Row] = Array([10/02/2014], [17/02/2014])
>>
>> stored as 'dd/MM/yyyy'
>>
>> The current time I get as
>>
>> scala> val today = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),
>> 'dd/MM/yyyy') ").collect.apply(0).getString(0)
>> today: String = 21/03/2016
>>
>>
>> So I want to filter the csv file
>>
>> scala>  df.filter(col("Payment date") < lit(today)).show(2)
>> +--------------+------------+---------+-----+---------+
>> |Invoice Number|Payment date|      Net|  VAT|    Total|
>> +--------------+------------+---------+-----+---------+
>> |           360|  10/02/2014|?2,500.00|?0.00|?2,500.00|
>> |           361|  17/02/2014|?2,500.00|?0.00|?2,500.00|
>> +--------------+------------+---------+-----+---------+
>>
>>
>> However, I want to use datediff() function here not just < today!
>>
>>
>> Obviously one can store the file as a table and use SQL on it. However, I
>> want to see if there are other ways using fp.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>

Re: Work out date column in CSV more than 6 months old (datediff or something)

Reply via email to