More update on this question......I am using spark 1.4.1. I was just reading documentation of spark 1.5 (still in development) and I think there will be a new func *datediff* that will solve the issue. So please let me know if there is any work-around until spark 1.5 is out :).
pyspark.sql.functions.datediff(*end*, *start*)[source] <http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff> <http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff> Returns the number of days from start to end. >>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', >>> 'd2'])>>> df.select(datediff(df.d2, >>> df.d1).alias('diff')).collect()[Row(diff=32)] New in version 1.5. On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1...@gmail.com> wrote: > Apologies, sent too early accidentally. Actual message is below > ======================================================== > > A dataframe has 2 datecolumns (datetime type) and I would like to add > another column that would have difference between these two dates. > Dataframe snippet is below. > > new_df.show(5) > +-----------+----------+--------------+ > | PATID| SVCDATE|next_diag_date| > +-----------+----------+--------------+ > |12345655545|2012-02-13| 2012-02-13| > |12345655545|2012-02-13| 2012-02-13| > |12345655545|2012-02-13| 2012-02-27| > +-----------+----------+--------------+ > > > > Here is what I have tried so far: > > -> new_df.withColumn('SVCDATE2', > (new_df.next_diag_date-new_df.SVCDATE)).show() > Error: DateType does not support numeric operations > > -> new_df.withColumn('SVCDATE2', > (new_df.next_diag_date-new_df.SVCDATE).days).show() > Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); > > > However this simple python code works fine with pySpark: > > from datetime import date > d0 = date(2008, 8, 18) > d1 = date(2008, 9, 26) > delta = d0 - d1 > print (d0 - d1).days > > # -39 > > > Any suggestions would be appreciated! Also is there a way to add a new > column in dataframe without using column expression (e.g. like in pandas or > R. df$new_col = 'new col value')? > > > Thanks, > Dhaval > > > > On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1...@gmail.com> > wrote: > >> new_df.withColumn('SVCDATE2', >> (new_df.next_diag_date-new_df.SVCDATE).days).show() >> >> +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date| >> +-----------+----------+--------------+ |12345655545|2012-02-13| >> 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| >> 2012-02-27| +-----------+----------+--------------+ >> > >