I am using pyspark 1.6.0 and datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series data
The data is originally captured by a spark streaming app and written to Cassandra. The value of the timestamp comes from Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() }); I am confident the time stamp is stored correctly in cassandra and that the clocks on the machines in my cluster are set correctly I noticed that if I used Cassandra CQLSH to select a data set between two points in time the row count did not match the row count I got when I did the same select in spark using SQL, It appears the spark sql assumes all timestamp strings are in the local time zone. Here is what I expect. (this is what is returned by CQLSH) cqlsh> select ... count(row_key) as num_samples, sum(count) as total, max(count) as max ... from ... notification.json_timeseries ... where ... row_key in (red', blue') ... and created > '2016-03-12 00:30:00+0000' ... and created <= '2016-03-12 04:30:00+0000' ... allow filtering; num_samples | total| max -------------+------------------+--------------- 3242 |11277 | 17 Here is my pyspark select statement. Notice the created column encodes the timezone¹. I am running this on my local mac (in PST timezone) and connecting to my data center (which runs on UTC) over a VPN. rawDF = sqlContext.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="json_timeseries", keyspace="notification")\ .load() rawDF.registerTempTable(tmpTableName) stmnt = "select \ row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \ to_utc_timestamp(created, 'gmt') as gmt \ from \ rawTable \ where \ (created > '{0}') and (created <= '{1}') \ and \ (row_key = red' or row_key = blue¹) \ )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000') rawDF = sqlCtx.sql(stmnt).cache() I get a different values for row count, max, If I convert the UTC time stamp string to my local timezone the row count matches the count returned by cqlsh # pst works, matches cassandra cqlsh # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000') Am I doing something wrong in my pyspark code? Kind regards Andy --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org