Hi, hope this will help you
import org.apache.spark.sql.functions._ import sqlContext.implicits._ import java.sql.Timestamp val df = sc.parallelize(Array((date1, date2))).toDF("day1", "day2") val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) => Days.daysBetween(new DateTime(value2.getTime), new DateTime(value1.getTime)).getDays) df.withColumn("diff", dateDiff(df("day2"), df("day1"))).show() or you can write sql query using hiveql's datediff function. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel <dhaval1...@gmail.com> wrote: > More update on this question......I am using spark 1.4.1. > > I was just reading documentation of spark 1.5 (still in development) and I > think there will be a new func *datediff* that will solve the issue. So > please let me know if there is any work-around until spark 1.5 is out :). > > pyspark.sql.functions.datediff(*end*, *start*)[source] > <http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff> > <http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff> > > Returns the number of days from start to end. > > >>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', > >>> 'd2'])>>> df.select(datediff(df.d2, > >>> df.d1).alias('diff')).collect()[Row(diff=32)] > > New in version 1.5. > > On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1...@gmail.com> > wrote: > >> Apologies, sent too early accidentally. Actual message is below >> ======================================================== >> >> A dataframe has 2 datecolumns (datetime type) and I would like to add >> another column that would have difference between these two dates. >> Dataframe snippet is below. >> >> new_df.show(5) >> +-----------+----------+--------------+ >> | PATID| SVCDATE|next_diag_date| >> +-----------+----------+--------------+ >> |12345655545|2012-02-13| 2012-02-13| >> |12345655545|2012-02-13| 2012-02-13| >> |12345655545|2012-02-13| 2012-02-27| >> +-----------+----------+--------------+ >> >> >> >> Here is what I have tried so far: >> >> -> new_df.withColumn('SVCDATE2', >> (new_df.next_diag_date-new_df.SVCDATE)).show() >> Error: DateType does not support numeric operations >> >> -> new_df.withColumn('SVCDATE2', >> (new_df.next_diag_date-new_df.SVCDATE).days).show() >> Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); >> >> >> However this simple python code works fine with pySpark: >> >> from datetime import date >> d0 = date(2008, 8, 18) >> d1 = date(2008, 9, 26) >> delta = d0 - d1 >> print (d0 - d1).days >> >> # -39 >> >> >> Any suggestions would be appreciated! Also is there a way to add a new >> column in dataframe without using column expression (e.g. like in pandas or >> R. df$new_col = 'new col value')? >> >> >> Thanks, >> Dhaval >> >> >> >> On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1...@gmail.com> >> wrote: >> >>> new_df.withColumn('SVCDATE2', >>> (new_df.next_diag_date-new_df.SVCDATE).days).show() >>> >>> +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date| >>> +-----------+----------+--------------+ |12345655545|2012-02-13| >>> 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| >>> 2012-02-27| +-----------+----------+--------------+ >>> >> >> >