bug spark should not use java.sql.timestamp was: sql timestamp timezone bug
Here is a nice analysis of the issue from the Cassandra mail list. (Datastax is the Databricks for Cassandra) Should I fill a bug? Kind regards Andy http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date and this one On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer <russ...@datastax.com> wrote: > Unfortunately part of Spark SQL. They have based their type on > java.sql.timestamp (and date) which adjust to the client timezone when > displaying and storing. > See discussions > http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-d > ate > And Code > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/ > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.s > cala#L81-L93 > From: Andrew Davidson <a...@santacruzintegration.com> Date: Thursday, March 17, 2016 at 3:25 PM To: Andrew Davidson <a...@santacruzintegration.com>, "user @spark" <user@spark.apache.org> Subject: Re: sql timestamp timezone bug > > For completeness. Clearly spark sql returned a different data set > > In [4]: > rawDF.selectExpr("count(row_key) as num_samples", > "sum(count) as total", > "max(count) as max ").show() > +---++-+ > |num_samples|total|max| > +---++-+ > | 2037| 3867| 67| > +---++-+ > > > From: Andrew Davidson <a...@santacruzintegration.com> > Date: Thursday, March 17, 2016 at 3:02 PM > To: "user @spark" <user@spark.apache.org> > Subject: sql timestamp timezone bug > >> I am using pyspark 1.6.0 and >> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series >> data >> >> The data is originally captured by a spark streaming app and written to >> Cassandra. The value of the timestamp comes from >> >> Rdd.foreachRDD(new VoidFunction2<JavaRDD, Time>() >> �}); >> >> I am confident the time stamp is stored correctly in cassandra and that >> the clocks on the machines in my cluster are set correctly >> >> I noticed that if I used Cassandra CQLSH to select a data set between two >> points in time the row count did not match the row count I got when I did >> the same select in spark using SQL, It appears the spark sql assumes all >> timestamp strings are in the local time zone. >> >> >> Here is what I expect. (this is what is returned by CQLSH) >> cqlsh> select >>... count(row_key) as num_samples, sum(count) as total, max(count) >> as max >>... from >>... notification.json_timeseries >>... where >>... row_key in (똱ed', 똟lue') >>... and created > '2016-03-12 00:30:00+' >>... and created <= '2016-03-12 04:30:00+' >>... allow filtering; >> >> num_samples | total| max >> -+--+--- >> 3242 |11277 | 17 >> >> >> Here is my pyspark select statement. Notice the 똠reated column encodes >> the timezone¹. I am running this on my local mac (in PST timezone) and >> connecting to my data center (which runs on UTC) over a VPN. >> >> rawDF = sqlContext.read\ >> .format("org.apache.spark.sql.cassandra")\ >> .options(table="json_timeseries", keyspace="notification")\ >> .load() >> >> >> rawDF.registerTempTable(tmpTableName) >> >> >> >> stmnt = "select \ >> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ >> unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \ >> to_utc_timestamp(created, 'gmt') as gmt \ >> from \ >> rawTable \ >> where \ >> (created > '{0}') and (created <= '{1}') \ >> and \ >> (row_key = 똱ed' or row_key = 똟lue¹) \ >> )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+') >> >> rawDF = sqlCtx.sql(stmnt).cache() >> >> >> >> >> I get a different values for row count, max, � >> >> If I convert the UTC time stamp string to my local timezone the row count >> matches the count returned by cqlsh >> >> # pst works, matches cassandra cqlsh >> # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+') >> >> Am I doing something wrong in my pyspark code? >> >> >> Kind regards >> >> Andy >> >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>
Re: sql timestamp timezone bug
Hi Davies > > What's the type of `created`? TimestampType? The created¹ column in cassandra is a timestamp https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html In the spark data frame it is a a timestamp > > If yes, when created is compared to a string, it will be casted into > string, then compared as string, it become > > cast(created, as string) > '2016-03-12 00:30:00+' > > Could you try this > > sqlCtx.sql("select created, cast(created as string) from rawTable").show() I am note sure I under stand your suggestion. In my where clause the date range is specified using string literals. I need the value of created to be a time stamps # http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html stmnt = "select \ row_key, created, cast(created as string), count, unix_timestamp(created) as unixTimeStamp, \ unix_timestamp(created, '-MM-dd HH:mm:ss.zz') as aedwip, \ to_utc_timestamp(created, 'gmt') as gmt \ from \ rawTable \ where \ (created > '{0}') and (created <= '{1}') \ and \ (row_key = red' \ or row_key = blue' )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+') rawDF = sqlContext.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="json_timeseries", keyspace="notification")\ .load() rawDF.registerTempTable(tmpTableName) rawDF = sqlCtx.sql(stmnt).cache() The time stamps are still not UTC they are in PST root |-- row_key: string (nullable = true) |-- created: timestamp (nullable = true) |-- created: string (nullable = true) |-- count: long (nullable = true) |-- unixTimeStamp: long (nullable = true) |-- aedwip: long (nullable = true) |-- gmt: timestamp (nullable = true) +-+-+---+-+- +--+-+ |row_key |created |created |count|unixTimeStamp|aedwip|gmt | +-+-+---+-+- +--+-+ |blue |2016-03-12 00:30:30.0|2016-03-12 00:30:30|2|1457771430 |1457771430|2016-03-12 00:30:30.0| |blue |2016-03-12 00:30:45.0|2016-03-12 00:30:45|1|1457771445 |1457771445|2016-03-12 00:30:45.0| |blue |2016-03-12 00:31:00.0|2016-03-12 00:31:00|1|1457771460 |1457771460|2016-03-12 00:31:00.0| | Kind regards Andy
Re: sql timestamp timezone bug
For completeness. Clearly spark sql returned a different data set In [4]: rawDF.selectExpr("count(row_key) as num_samples", "sum(count) as total", "max(count) as max ").show() +---++-+ |num_samples|total|max| +---++-+ | 2037| 3867| 67| +---++-+ From: Andrew Davidson <a...@santacruzintegration.com> Date: Thursday, March 17, 2016 at 3:02 PM To: "user @spark" <user@spark.apache.org> Subject: sql timestamp timezone bug > I am using pyspark 1.6.0 and > datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series > data > > The data is originally captured by a spark streaming app and written to > Cassandra. The value of the timestamp comes from > > Rdd.foreachRDD(new VoidFunction2<JavaRDD, Time>() > �}); > > I am confident the time stamp is stored correctly in cassandra and that > the clocks on the machines in my cluster are set correctly > > I noticed that if I used Cassandra CQLSH to select a data set between two > points in time the row count did not match the row count I got when I did > the same select in spark using SQL, It appears the spark sql assumes all > timestamp strings are in the local time zone. > > > Here is what I expect. (this is what is returned by CQLSH) > cqlsh> select >... count(row_key) as num_samples, sum(count) as total, max(count) > as max >... from >... notification.json_timeseries >... where >... row_key in (똱ed', 똟lue') >... and created > '2016-03-12 00:30:00+' >... and created <= '2016-03-12 04:30:00+' >... allow filtering; > > num_samples | total| max > -+--+--- > 3242 |11277 | 17 > > > Here is my pyspark select statement. Notice the 똠reated column encodes > the timezone¹. I am running this on my local mac (in PST timezone) and > connecting to my data center (which runs on UTC) over a VPN. > > rawDF = sqlContext.read\ > .format("org.apache.spark.sql.cassandra")\ > .options(table="json_timeseries", keyspace="notification")\ > .load() > > > rawDF.registerTempTable(tmpTableName) > > > > stmnt = "select \ > row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ > unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \ > to_utc_timestamp(created, 'gmt') as gmt \ > from \ > rawTable \ > where \ > (created > '{0}') and (created <= '{1}') \ > and \ > (row_key = 똱ed' or row_key = 똟lue¹) \ > )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+') > > rawDF = sqlCtx.sql(stmnt).cache() > > > > > I get a different values for row count, max, � > > If I convert the UTC time stamp string to my local timezone the row count > matches the count returned by cqlsh > > # pst works, matches cassandra cqlsh > # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+') > > Am I doing something wrong in my pyspark code? > > > Kind regards > > Andy > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
sql timestamp timezone bug
I am using pyspark 1.6.0 and datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series data The data is originally captured by a spark streaming app and written to Cassandra. The value of the timestamp comes from Rdd.foreachRDD(new VoidFunction2() }); I am confident the time stamp is stored correctly in cassandra and that the clocks on the machines in my cluster are set correctly I noticed that if I used Cassandra CQLSH to select a data set between two points in time the row count did not match the row count I got when I did the same select in spark using SQL, It appears the spark sql assumes all timestamp strings are in the local time zone. Here is what I expect. (this is what is returned by CQLSH) cqlsh> select ... count(row_key) as num_samples, sum(count) as total, max(count) as max ... from ... notification.json_timeseries ... where ... row_key in (red', blue') ... and created > '2016-03-12 00:30:00+' ... and created <= '2016-03-12 04:30:00+' ... allow filtering; num_samples | total| max -+--+--- 3242 |11277 | 17 Here is my pyspark select statement. Notice the created column encodes the timezone¹. I am running this on my local mac (in PST timezone) and connecting to my data center (which runs on UTC) over a VPN. rawDF = sqlContext.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="json_timeseries", keyspace="notification")\ .load() rawDF.registerTempTable(tmpTableName) stmnt = "select \ row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \ to_utc_timestamp(created, 'gmt') as gmt \ from \ rawTable \ where \ (created > '{0}') and (created <= '{1}') \ and \ (row_key = red' or row_key = blue¹) \ )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+') rawDF = sqlCtx.sql(stmnt).cache() I get a different values for row count, max, If I convert the UTC time stamp string to my local timezone the row count matches the count returned by cqlsh # pst works, matches cassandra cqlsh # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+') Am I doing something wrong in my pyspark code? Kind regards Andy - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sql timestamp timezone bug
On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidsonwrote: > I am using pyspark 1.6.0 and > datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series > data > > The data is originally captured by a spark streaming app and written to > Cassandra. The value of the timestamp comes from > > Rdd.foreachRDD(new VoidFunction2 () > Š}); > > I am confident the time stamp is stored correctly in cassandra and that > the clocks on the machines in my cluster are set correctly > > I noticed that if I used Cassandra CQLSH to select a data set between two > points in time the row count did not match the row count I got when I did > the same select in spark using SQL, It appears the spark sql assumes all > timestamp strings are in the local time zone. > > > Here is what I expect. (this is what is returned by CQLSH) > cqlsh> select >... count(row_key) as num_samples, sum(count) as total, max(count) > as max >... from >... notification.json_timeseries >... where >... row_key in (Œred', Œblue') >... and created > '2016-03-12 00:30:00+' >... and created <= '2016-03-12 04:30:00+' >... allow filtering; > > num_samples | total| max > -+--+--- > 3242 |11277 | 17 > > > Here is my pyspark select statement. Notice the Œcreated column encodes > the timezone¹. I am running this on my local mac (in PST timezone) and > connecting to my data center (which runs on UTC) over a VPN. > > rawDF = sqlContext.read\ > .format("org.apache.spark.sql.cassandra")\ > .options(table="json_timeseries", keyspace="notification")\ > .load() > > > rawDF.registerTempTable(tmpTableName) > > > > stmnt = "select \ > row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ > unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \ > to_utc_timestamp(created, 'gmt') as gmt \ > from \ > rawTable \ > where \ > (created > '{0}') and (created <= '{1}') \ > and \ > (row_key = Œred' or row_key = Œblue¹) \ > )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+') > > rawDF = sqlCtx.sql(stmnt).cache() What's the type of `created`? TimestampType? If yes, when created is compared to a string, it will be casted into string, then compared as string, it become cast(created, as string) > '2016-03-12 00:30:00+' Could you try this sqlCtx.sql("select created, cast(created as string) from rawTable").show() > > > > I get a different values for row count, max, Š > > If I convert the UTC time stamp string to my local timezone the row count > matches the count returned by cqlsh > > # pst works, matches cassandra cqlsh > # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+') > > Am I doing something wrong in my pyspark code? > > > Kind regards > > Andy > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sql timestamp timezone bug
In Spark SQL, timestamp is the number of micro seconds since epoch, so it has nothing with timezone. When you compare it again unix_timestamp or string, it's better to convert these into timestamp then compare them. In your case, the where clause should be: where (created > cast('{0}' as timestamp)) and (created <= cast('{1}' as timestamp)) Could you try this? On Fri, Mar 18, 2016 at 11:10 AM, Andy Davidsonwrote: > Hi Davies > > > > What's the type of `created`? TimestampType? > > > > The ‘created’ column in cassandra is a timestamp > https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html > > In the spark data frame it is a a timestamp > > > If yes, when created is compared to a string, it will be casted into > string, then compared as string, it become > > cast(created, as string) > '2016-03-12 00:30:00+' > > Could you try this > > sqlCtx.sql("select created, cast(created as string) from rawTable").show() > > > > I am note sure I under stand your suggestion. In my where clause the date > range is specified using string literals. I need the value of created to be > a time stamps > > # http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html > stmnt = "select \ > row_key, created, cast(created as string), count, > unix_timestamp(created) as unixTimeStamp, \ > unix_timestamp(created, '-MM-dd HH:mm:ss.zz') as aedwip, \ > to_utc_timestamp(created, 'gmt') as gmt \ > from \ > rawTable \ > where \ > (created > '{0}') and (created <= '{1}') \ > and \ > (row_key = ‘red' \ > or row_key = ‘blue' )".format('2016-03-12 > 00:30:00+', '2016-03-12 04:30:00+') > > rawDF = sqlContext.read\ > .format("org.apache.spark.sql.cassandra")\ > .options(table="json_timeseries", keyspace="notification")\ > .load() > rawDF.registerTempTable(tmpTableName) > rawDF = sqlCtx.sql(stmnt).cache() > > > The time stamps are still not UTC they are in PST > > root > |-- row_key: string (nullable = true) > |-- created: timestamp (nullable = true) > |-- created: string (nullable = true) > |-- count: long (nullable = true) > |-- unixTimeStamp: long (nullable = true) > |-- aedwip: long (nullable = true) > |-- gmt: timestamp (nullable = true) > > +-+-+---+-+-+--+-+ > |row_key |created |created > |count|unixTimeStamp|aedwip|gmt | > +-+-+---+-+-+--+-+ > |blue |2016-03-12 00:30:30.0|2016-03-12 00:30:30|2|1457771430 > |1457771430|2016-03-12 00:30:30.0| > |blue |2016-03-12 00:30:45.0|2016-03-12 00:30:45|1|1457771445 > |1457771445|2016-03-12 00:30:45.0| > |blue |2016-03-12 00:31:00.0|2016-03-12 00:31:00|1|1457771460 > |1457771460|2016-03-12 00:31:00.0| > | > > > Kind regards > > Andy - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org