bug spark should not use java.sql.timestamp was: sql timestamp timezone bug

2016-03-20 Thread Andy Davidson
Here is a nice analysis of the issue from the Cassandra mail list. (Datastax
is the Databricks for Cassandra)

Should I fill a bug?

Kind regards

Andy

http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date
and this one 

On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer <russ...@datastax.com>
wrote:
> Unfortunately part of Spark SQL. They have based their type on
> java.sql.timestamp (and date) which adjust to the client timezone when
> displaying and storing.
> See discussions
> http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-d
> ate
> And Code
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.s
> cala#L81-L93
> 

From:  Andrew Davidson <a...@santacruzintegration.com>
Date:  Thursday, March 17, 2016 at 3:25 PM
To:  Andrew Davidson <a...@santacruzintegration.com>, "user @spark"
<user@spark.apache.org>
Subject:  Re: sql timestamp timezone bug

> 
> For completeness. Clearly spark sql returned a different data set
> 
> In [4]:
> rawDF.selectExpr("count(row_key) as num_samples",
> "sum(count) as total",
> "max(count) as max ").show()
> +---++-+
> |num_samples|total|max|
> +---++-+
> |   2037| 3867| 67|
> +---++-+
> 
> 
> From:  Andrew Davidson <a...@santacruzintegration.com>
> Date:  Thursday, March 17, 2016 at 3:02 PM
> To:  "user @spark" <user@spark.apache.org>
> Subject:  sql timestamp timezone bug
> 
>> I am using pyspark 1.6.0 and
>> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
>> data
>> 
>> The data is originally captured by a spark streaming app and written to
>> Cassandra. The value of the timestamp comes from
>> 
>> Rdd.foreachRDD(new VoidFunction2<JavaRDD, Time>()
>> �});
>> 
>> I am confident the time stamp is stored correctly in cassandra and that
>> the clocks on the machines in my cluster are set correctly
>> 
>> I noticed that if I used Cassandra CQLSH to select a data set between two
>> points in time the row count did not match the row count I got when I did
>> the same select in spark using SQL, It appears the spark sql assumes all
>> timestamp strings are in the local time zone.
>> 
>> 
>> Here is what I expect. (this is what is returned by CQLSH)
>> cqlsh> select
>>... count(row_key) as num_samples, sum(count) as total, max(count)
>> as max
>>... from
>>... notification.json_timeseries
>>... where
>>... row_key in (똱ed', 똟lue')
>>... and created > '2016-03-12 00:30:00+'
>>... and created <= '2016-03-12 04:30:00+'
>>... allow filtering;
>> 
>>  num_samples | total| max
>> -+--+---
>> 3242 |11277 |  17
>> 
>> 
>> Here is  my pyspark select statement. Notice the 똠reated column encodes
>> the timezone¹. I am running this on my local mac (in PST timezone) and
>> connecting to my data center (which runs on UTC) over a VPN.
>> 
>> rawDF = sqlContext.read\
>> .format("org.apache.spark.sql.cassandra")\
>> .options(table="json_timeseries", keyspace="notification")\
>> .load() 
>> 
>> 
>> rawDF.registerTempTable(tmpTableName)
>> 
>> 
>> 
>> stmnt = "select \
>> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
>> unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \
>> to_utc_timestamp(created, 'gmt') as gmt \
>> from \
>> rawTable \
>> where \
>> (created > '{0}') and (created <= '{1}') \
>> and \
>> (row_key = 똱ed' or row_key = 똟lue¹) \
>> )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+')
>> 
>> rawDF = sqlCtx.sql(stmnt).cache()
>> 
>> 
>> 
>> 
>> I get a different values for row count, max, �
>> 
>> If I convert the UTC time stamp string to my local timezone the row count
>> matches the count returned by  cqlsh
>> 
>> # pst works, matches cassandra cqlsh
>> # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+')
>> 
>> Am I doing something wrong in my pyspark code?
>> 
>> 
>> Kind regards
>> 
>> Andy
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 




Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
Hi Davies


> 
> What's the type of `created`? TimestampType?


The Œcreated¹ column in cassandra is a timestamp
https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html

In the spark data frame it is a a timestamp

> 
> If yes, when created is compared to a string, it will be casted into
> string, then compared as string, it become
> 
> cast(created, as string) > '2016-03-12 00:30:00+'
> 
> Could you try this
> 
> sqlCtx.sql("select created, cast(created as string) from rawTable").show()


I am note sure I under stand your suggestion. In my where clause the date
range is specified using string literals. I need the value of created to be
a time stamps

# http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
stmnt = "select \
row_key, created,  cast(created as string), count,
unix_timestamp(created) as unixTimeStamp, \
unix_timestamp(created, '-MM-dd HH:mm:ss.zz') as aedwip, \
to_utc_timestamp(created, 'gmt') as gmt \
 from \
rawTable \
 where \
 (created > '{0}') and (created <= '{1}') \
 and \
 (row_key = Œred' \
or row_key = Œblue' )".format('2016-03-12
00:30:00+', '2016-03-12 04:30:00+')

rawDF = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="json_timeseries", keyspace="notification")\
.load() 
rawDF.registerTempTable(tmpTableName)
rawDF = sqlCtx.sql(stmnt).cache()


The time stamps are still not UTC they are in PST

root
 |-- row_key: string (nullable = true)
 |-- created: timestamp (nullable = true)
 |-- created: string (nullable = true)
 |-- count: long (nullable = true)
 |-- unixTimeStamp: long (nullable = true)
 |-- aedwip: long (nullable = true)
 |-- gmt: timestamp (nullable = true)

+-+-+---+-+-
+--+-+
|row_key  |created  |created
|count|unixTimeStamp|aedwip|gmt  |
+-+-+---+-+-
+--+-+
|blue |2016-03-12 00:30:30.0|2016-03-12 00:30:30|2|1457771430
|1457771430|2016-03-12 00:30:30.0|
|blue |2016-03-12 00:30:45.0|2016-03-12 00:30:45|1|1457771445
|1457771445|2016-03-12 00:30:45.0|
|blue |2016-03-12 00:31:00.0|2016-03-12 00:31:00|1|1457771460
|1457771460|2016-03-12 00:31:00.0|
|

Kind regards

Andy




Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson

For completeness. Clearly spark sql returned a different data set

In [4]:
rawDF.selectExpr("count(row_key) as num_samples",
"sum(count) as total",
"max(count) as max ").show()
+---++-+
|num_samples|total|max|
+---++-+
|   2037| 3867| 67|
+---++-+


From:  Andrew Davidson <a...@santacruzintegration.com>
Date:  Thursday, March 17, 2016 at 3:02 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  sql timestamp timezone bug

> I am using pyspark 1.6.0 and
> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
> data
> 
> The data is originally captured by a spark streaming app and written to
> Cassandra. The value of the timestamp comes from
> 
> Rdd.foreachRDD(new VoidFunction2<JavaRDD, Time>()
> �});
> 
> I am confident the time stamp is stored correctly in cassandra and that
> the clocks on the machines in my cluster are set correctly
> 
> I noticed that if I used Cassandra CQLSH to select a data set between two
> points in time the row count did not match the row count I got when I did
> the same select in spark using SQL, It appears the spark sql assumes all
> timestamp strings are in the local time zone.
> 
> 
> Here is what I expect. (this is what is returned by CQLSH)
> cqlsh> select
>... count(row_key) as num_samples, sum(count) as total, max(count)
> as max
>... from
>... notification.json_timeseries
>... where
>... row_key in (똱ed', 똟lue')
>... and created > '2016-03-12 00:30:00+'
>... and created <= '2016-03-12 04:30:00+'
>... allow filtering;
> 
>  num_samples | total| max
> -+--+---
> 3242 |11277 |  17
> 
> 
> Here is  my pyspark select statement. Notice the 똠reated column encodes
> the timezone¹. I am running this on my local mac (in PST timezone) and
> connecting to my data center (which runs on UTC) over a VPN.
> 
> rawDF = sqlContext.read\
> .format("org.apache.spark.sql.cassandra")\
> .options(table="json_timeseries", keyspace="notification")\
> .load() 
> 
> 
> rawDF.registerTempTable(tmpTableName)
> 
> 
> 
> stmnt = "select \
> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
> unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \
> to_utc_timestamp(created, 'gmt') as gmt \
> from \
> rawTable \
> where \
> (created > '{0}') and (created <= '{1}') \
> and \
> (row_key = 똱ed' or row_key = 똟lue¹) \
> )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+')
> 
> rawDF = sqlCtx.sql(stmnt).cache()
> 
> 
> 
> 
> I get a different values for row count, max, �
> 
> If I convert the UTC time stamp string to my local timezone the row count
> matches the count returned by  cqlsh
> 
> # pst works, matches cassandra cqlsh
> # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+')
> 
> Am I doing something wrong in my pyspark code?
> 
> 
> Kind regards
> 
> Andy
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 




sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
I am using pyspark 1.6.0 and
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
data

The data is originally captured by a spark streaming app and written to
Cassandra. The value of the timestamp comes from

Rdd.foreachRDD(new VoidFunction2()
Š});

I am confident the time stamp is stored correctly in cassandra and that
the clocks on the machines in my cluster are set correctly

I noticed that if I used Cassandra CQLSH to select a data set between two
points in time the row count did not match the row count I got when I did
the same select in spark using SQL, It appears the spark sql assumes all
timestamp strings are in the local time zone.


Here is what I expect. (this is what is returned by CQLSH)
cqlsh> select
   ... count(row_key) as num_samples, sum(count) as total, max(count)
as max
   ... from
   ... notification.json_timeseries
   ... where
   ... row_key in (Œred', Œblue')
   ... and created > '2016-03-12 00:30:00+'
   ... and created <= '2016-03-12 04:30:00+'
   ... allow filtering;

 num_samples | total| max
-+--+---
3242 |11277 |  17


Here is  my pyspark select statement. Notice the Œcreated column encodes
the timezone¹. I am running this on my local mac (in PST timezone) and
connecting to my data center (which runs on UTC) over a VPN.

rawDF = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="json_timeseries", keyspace="notification")\
.load() 


rawDF.registerTempTable(tmpTableName)



stmnt = "select \
row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \
to_utc_timestamp(created, 'gmt') as gmt \
from \
rawTable \
where \
(created > '{0}') and (created <= '{1}') \
and \
(row_key = Œred' or row_key = Œblue¹) \
)".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+')

rawDF = sqlCtx.sql(stmnt).cache()




I get a different values for row count, max, Š

If I convert the UTC time stamp string to my local timezone the row count
matches the count returned by  cqlsh

# pst works, matches cassandra cqlsh
# .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+')

Am I doing something wrong in my pyspark code?


Kind regards

Andy



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu
On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson
 wrote:
> I am using pyspark 1.6.0 and
> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
> data
>
> The data is originally captured by a spark streaming app and written to
> Cassandra. The value of the timestamp comes from
>
> Rdd.foreachRDD(new VoidFunction2()
> Š});
>
> I am confident the time stamp is stored correctly in cassandra and that
> the clocks on the machines in my cluster are set correctly
>
> I noticed that if I used Cassandra CQLSH to select a data set between two
> points in time the row count did not match the row count I got when I did
> the same select in spark using SQL, It appears the spark sql assumes all
> timestamp strings are in the local time zone.
>
>
> Here is what I expect. (this is what is returned by CQLSH)
> cqlsh> select
>... count(row_key) as num_samples, sum(count) as total, max(count)
> as max
>... from
>... notification.json_timeseries
>... where
>... row_key in (Œred', Œblue')
>... and created > '2016-03-12 00:30:00+'
>... and created <= '2016-03-12 04:30:00+'
>... allow filtering;
>
>  num_samples | total| max
> -+--+---
> 3242 |11277 |  17
>
>
> Here is  my pyspark select statement. Notice the Œcreated column encodes
> the timezone¹. I am running this on my local mac (in PST timezone) and
> connecting to my data center (which runs on UTC) over a VPN.
>
> rawDF = sqlContext.read\
> .format("org.apache.spark.sql.cassandra")\
> .options(table="json_timeseries", keyspace="notification")\
> .load()
>
>
> rawDF.registerTempTable(tmpTableName)
>
>
>
> stmnt = "select \
> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
> unix_timestamp(created, '-MM-dd HH:mm:ss.z') as hack, \
> to_utc_timestamp(created, 'gmt') as gmt \
> from \
> rawTable \
> where \
> (created > '{0}') and (created <= '{1}') \
> and \
> (row_key = Œred' or row_key = Œblue¹) \
> )".format('2016-03-12 00:30:00+', '2016-03-12 04:30:00+')
>
> rawDF = sqlCtx.sql(stmnt).cache()

What's the type of `created`? TimestampType?

If yes, when created is compared to a string, it will be casted into
string, then compared as string, it become

cast(created, as string) > '2016-03-12 00:30:00+'

Could you try this

sqlCtx.sql("select created, cast(created as string) from rawTable").show()



>
>
>
> I get a different values for row count, max, Š
>
> If I convert the UTC time stamp string to my local timezone the row count
> matches the count returned by  cqlsh
>
> # pst works, matches cassandra cqlsh
> # .format('2016-03-11 16:30:00+', '2016-03-11 20:30:00+')
>
> Am I doing something wrong in my pyspark code?
>
>
> Kind regards
>
> Andy
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu
In Spark SQL, timestamp is the number of micro seconds since epoch, so
it has nothing with timezone.

When you compare it again unix_timestamp or string, it's better to
convert these into timestamp then compare them.

In your case, the where clause should be:

where (created > cast('{0}' as timestamp)) and (created <=  cast('{1}'
as timestamp))

Could you try this?

On Fri, Mar 18, 2016 at 11:10 AM, Andy Davidson
 wrote:
> Hi Davies
>
>
>
> What's the type of `created`? TimestampType?
>
>
>
> The ‘created’ column in cassandra is a timestamp
> https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html
>
> In the spark data frame it is a a timestamp
>
>
> If yes, when created is compared to a string, it will be casted into
> string, then compared as string, it become
>
> cast(created, as string) > '2016-03-12 00:30:00+'
>
> Could you try this
>
> sqlCtx.sql("select created, cast(created as string) from rawTable").show()
>
>
>
> I am note sure I under stand your suggestion. In my where clause the date
> range is specified using string literals. I need the value of created to be
> a time stamps
>
> # http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
> stmnt = "select \
> row_key, created,  cast(created as string), count,
> unix_timestamp(created) as unixTimeStamp, \
> unix_timestamp(created, '-MM-dd HH:mm:ss.zz') as aedwip, \
> to_utc_timestamp(created, 'gmt') as gmt \
>  from \
> rawTable \
>  where \
>  (created > '{0}') and (created <= '{1}') \
>  and \
>  (row_key = ‘red' \
> or row_key = ‘blue' )".format('2016-03-12
> 00:30:00+', '2016-03-12 04:30:00+')
>
> rawDF = sqlContext.read\
> .format("org.apache.spark.sql.cassandra")\
> .options(table="json_timeseries", keyspace="notification")\
> .load()
> rawDF.registerTempTable(tmpTableName)
> rawDF = sqlCtx.sql(stmnt).cache()
>
>
> The time stamps are still not UTC they are in PST
>
> root
>  |-- row_key: string (nullable = true)
>  |-- created: timestamp (nullable = true)
>  |-- created: string (nullable = true)
>  |-- count: long (nullable = true)
>  |-- unixTimeStamp: long (nullable = true)
>  |-- aedwip: long (nullable = true)
>  |-- gmt: timestamp (nullable = true)
>
> +-+-+---+-+-+--+-+
> |row_key  |created  |created
> |count|unixTimeStamp|aedwip|gmt  |
> +-+-+---+-+-+--+-+
> |blue |2016-03-12 00:30:30.0|2016-03-12 00:30:30|2|1457771430
> |1457771430|2016-03-12 00:30:30.0|
> |blue |2016-03-12 00:30:45.0|2016-03-12 00:30:45|1|1457771445
> |1457771445|2016-03-12 00:30:45.0|
> |blue |2016-03-12 00:31:00.0|2016-03-12 00:31:00|1|1457771460
> |1457771460|2016-03-12 00:31:00.0|
> |
>
>
> Kind regards
>
> Andy

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org