Apologies, sent too early accidentally. Actual message is below ========================================================
A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below. new_df.show(5) +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date| +-----------+----------+--------------+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +-----------+----------+--------------+ Here is what I have tried so far: -> new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() Error: DateType does not support numeric operations -> new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); However this simple python code works fine with pySpark: from datetime import date d0 = date(2008, 8, 18) d1 = date(2008, 9, 26) delta = d0 - d1 print (d0 - d1).days # -39 Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1...@gmail.com> wrote: > new_df.withColumn('SVCDATE2', > (new_df.next_diag_date-new_df.SVCDATE).days).show() > > +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date| > +-----------+----------+--------------+ |12345655545|2012-02-13| > 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| > 2012-02-27| +-----------+----------+--------------+ >