On 22 March 2016 at 10:57, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> Thanks Silvio. > > The problem I have is that somehow string comparison does not work. > > Case in point > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2") > val current_date = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), > 'dd/MM/yyyy') ").collect.apply(0).getString(0) > df.filter(*lit(current_date) < col("Payment > date"*)).select(lit(current_date).alias("current_date"), > col("Payment date").alias("PaymentDate")).show(5) > > This is doing a string comparison not a date comparison (assuming "Payment date" is of type String). E.g. scala> "22/03/2016" < "24/02/2015" > > res4: Boolean = true > > >> scala> "22/03/2016" < "04/02/2015" > > res5: Boolean = false > > This is the correct result for a string comparison but it's not the comparison you want. I think you need to convert the "Payment date" with "to_date" and compare against that. E.g. something like: df.filter(current_date() < to_date(col("Payment date"))) Regards, James > It selects all the rows that are less than today's date (they are old). > > +------------+-----------+ > |current_date|PaymentDate| > +------------+-----------+ > | 22/03/2016| 24/02/2014| > | 22/03/2016| 24/03/2014| > | 22/03/2016| 31/03/2015| > | 22/03/2016| 28/04/2014| > | 22/03/2016| 26/05/2014| > +------------+-----------+ > > I don't know why this comparison is failing. May be it is comparing the > first two leftmost characters? > > Thanks > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 March 2016 at 00:26, Silvio Fiorito <silvio.fior...@granturing.com> > wrote: > >> There’s a months_between function you could use, as well: >> >> df.filter(months_between(current_date, $”Payment Date”) > 6).show >> >> From: Mich Talebzadeh <mich.talebza...@gmail.com> >> Date: Monday, March 21, 2016 at 5:53 PM >> To: "user @spark" <user@spark.apache.org> >> Subject: Work out date column in CSV more than 6 months old (datediff or >> something) >> >> Hi, >> >> For test purposes I am reading in a simple csv file as follows: >> >> val df = >> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >> "true").option("header", "true").load("/data/stg/table2") >> df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment >> date: string, Net: string, VAT: string, Total: string] >> >> For this work I am interested in column "Payment Date" > 6 months old >> from today >> >> Data is stored in the following format for that column >> >> scala> df.select("Payment date").take(2) >> res40: Array[org.apache.spark.sql.Row] = Array([10/02/2014], [17/02/2014]) >> >> stored as 'dd/MM/yyyy' >> >> The current time I get as >> >> scala> val today = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), >> 'dd/MM/yyyy') ").collect.apply(0).getString(0) >> today: String = 21/03/2016 >> >> >> So I want to filter the csv file >> >> scala> df.filter(col("Payment date") < lit(today)).show(2) >> +--------------+------------+---------+-----+---------+ >> |Invoice Number|Payment date| Net| VAT| Total| >> +--------------+------------+---------+-----+---------+ >> | 360| 10/02/2014|?2,500.00|?0.00|?2,500.00| >> | 361| 17/02/2014|?2,500.00|?0.00|?2,500.00| >> +--------------+------------+---------+-----+---------+ >> >> >> However, I want to use datediff() function here not just < today! >> >> >> Obviously one can store the file as a table and use SQL on it. However, I >> want to see if there are other ways using fp. >> >> Thanks >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> > >