On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson <a...@santacruzintegration.com> wrote: > I am using pyspark 1.6.0 and > datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series > data > > The data is originally captured by a spark streaming app and written to > Cassandra. The value of the timestamp comes from > > Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() > Š}); > > I am confident the time stamp is stored correctly in cassandra and that > the clocks on the machines in my cluster are set correctly > > I noticed that if I used Cassandra CQLSH to select a data set between two > points in time the row count did not match the row count I got when I did > the same select in spark using SQL, It appears the spark sql assumes all > timestamp strings are in the local time zone. > > > Here is what I expect. (this is what is returned by CQLSH) > cqlsh> select > ... count(row_key) as num_samples, sum(count) as total, max(count) > as max > ... from > ... notification.json_timeseries > ... where > ... row_key in (Œred', Œblue') > ... and created > '2016-03-12 00:30:00+0000' > ... and created <= '2016-03-12 04:30:00+0000' > ... allow filtering; > > num_samples | total| max > -------------+------------------+--------------- > 3242 |11277 | 17 > > > Here is my pyspark select statement. Notice the Œcreated column encodes > the timezone¹. I am running this on my local mac (in PST timezone) and > connecting to my data center (which runs on UTC) over a VPN. > > rawDF = sqlContext.read\ > .format("org.apache.spark.sql.cassandra")\ > .options(table="json_timeseries", keyspace="notification")\ > .load() > > > rawDF.registerTempTable(tmpTableName) > > > > stmnt = "select \ > row_key, created, count, unix_timestamp(created) as unixTimeStamp, \ > unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \ > to_utc_timestamp(created, 'gmt') as gmt \ > from \ > rawTable \ > where \ > (created > '{0}') and (created <= '{1}') \ > and \ > (row_key = Œred' or row_key = Œblue¹) \ > )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000') > > rawDF = sqlCtx.sql(stmnt).cache()
What's the type of `created`? TimestampType? If yes, when created is compared to a string, it will be casted into string, then compared as string, it become cast(created, as string) > '2016-03-12 00:30:00+0000' Could you try this sqlCtx.sql("select created, cast(created as string) from rawTable").show() > > > > I get a different values for row count, max, Š > > If I convert the UTC time stamp string to my local timezone the row count > matches the count returned by cqlsh > > # pst works, matches cassandra cqlsh > # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000') > > Am I doing something wrong in my pyspark code? > > > Kind regards > > Andy > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org