In my first attempt, I actually tried using case classes and then putting them 
into a data set. Scala, I guess doesn’t have a date time data type and I still 
wound up having to do some sort of conversion. When I tried to put the data 
into the dataset because I still had to define the column as a string. I mean 
is that right? Is it not possible to create a case class with a datatype of 
date or timestamp?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


From: Nicholas Hakobian [mailto:nicholas.hakob...@rallyhealth.com]
Sent: Tuesday, October 3, 2017 1:04 PM
To: Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: how do you deal with datetime in Spark?

I'd suggest first converting your string containing your date/time to a 
TimestampType or a DateType. Then the built in functions for year, month, day, 
etc. will then work as expected. If your date is in a "standard" format, you 
can perform the conversion just by casting the column to a date or timestamp 
type. The list of types it can auto-convert are listed at this link:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L270-L295

If casting won't work, you can manually convert it by specifying a format 
string with the following builtin function:
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.unix_timestamp

The format string uses the java simpleDateFormat format string, if I remember 
correctly 
(http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com<mailto:nicholas.hakob...@rallyhealth.com>


On Tue, Oct 3, 2017 at 10:43 AM, Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>> wrote:
I gave myself a project to start actually writing Spark programs. I’m using 
Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering 
by dates. It was awful and took forever. I was trying to use dataframes and SQL 
as much as possible. I see that there are date functions in the dataframe API 
but trying to use them was frustrating. Even following code samples was a 
headache because apparently the code is different depending on which version of 
Spark you are using. I was really hoping for a rich set of date functions like 
you’d find in T-SQL but I never really found them.

Is there a best practice for dealing with dates and time in Spark? I feel like 
taking a date/time string and converting it to a date/time object and then 
manipulating data based on the various components of the timestamp object 
(hour, day, year etc.) should be a heck of a lot easier than what I’m finding 
and perhaps I’m just not looking in the right place.

You can see my work here: 
https://github.com/BobLovesData/Apache-Spark-In-24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<tel:(913)%20938-6685>
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>



Reply via email to