For completeness. Clearly spark sql returned a different data set

In [4]:
rawDF.selectExpr("count(row_key) as num_samples",
                    "sum(count) as total",
                    "max(count) as max ").show()
+-----------+----------------+-------------+
|num_samples|total|max|
+-----------+----------------+-------------+
|       2037| 3867| 67|
+-----------+----------------+-------------+


From:  Andrew Davidson <a...@santacruzintegration.com>
Date:  Thursday, March 17, 2016 at 3:02 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  sql timestamp timezone bug

> I am using pyspark 1.6.0 and
> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
> data
> 
> The data is originally captured by a spark streaming app and written to
> Cassandra. The value of the timestamp comes from
> 
> Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
> �});
> 
> I am confident the time stamp is stored correctly in cassandra and that
> the clocks on the machines in my cluster are set correctly
> 
> I noticed that if I used Cassandra CQLSH to select a data set between two
> points in time the row count did not match the row count I got when I did
> the same select in spark using SQL, It appears the spark sql assumes all
> timestamp strings are in the local time zone.
> 
> 
> Here is what I expect. (this is what is returned by CQLSH)
> cqlsh> select
>    ...     count(row_key) as num_samples, sum(count) as total, max(count)
> as max
>    ... from
>    ...     notification.json_timeseries
>    ... where
>    ...     row_key in (똱ed', 똟lue')
>    ...     and created > '2016-03-12 00:30:00+0000'
>    ...     and created <= '2016-03-12 04:30:00+0000'
>    ... allow filtering;
> 
>  num_samples | total| max
> -------------+------------------+---------------
>         3242 |11277 |  17
> 
> 
> Here is  my pyspark select statement. Notice the 똠reated column encodes
> the timezone¹. I am running this on my local mac (in PST timezone) and
> connecting to my data center (which runs on UTC) over a VPN.
> 
> rawDF = sqlContext.read\
> .format("org.apache.spark.sql.cassandra")\
> .options(table="json_timeseries", keyspace="notification")\
> .load() 
> 
> 
> rawDF.registerTempTable(tmpTableName)
> 
> 
> 
> stmnt = "select \
> row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
> unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
> to_utc_timestamp(created, 'gmt') as gmt \
> from \
> rawTable \
> where \
> (created > '{0}') and (created <= '{1}') \
> and \
> (row_key = 똱ed' or row_key = 똟lue¹) \
> )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')
> 
> rawDF = sqlCtx.sql(stmnt).cache()
> 
> 
> 
> 
> I get a different values for row count, max, �
> 
> If I convert the UTC time stamp string to my local timezone the row count
> matches the count returned by  cqlsh
> 
> # pst works, matches cassandra cqlsh
> # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')
> 
> Am I doing something wrong in my pyspark code?
> 
> 
> Kind regards
> 
> Andy
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


Reply via email to