I am using pyspark 1.6.0 and
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
data

The data is originally captured by a spark streaming app and written to
Cassandra. The value of the timestamp comes from

Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
                Š});

I am confident the time stamp is stored correctly in cassandra and that
the clocks on the machines in my cluster are set correctly

I noticed that if I used Cassandra CQLSH to select a data set between two
points in time the row count did not match the row count I got when I did
the same select in spark using SQL, It appears the spark sql assumes all
timestamp strings are in the local time zone.


Here is what I expect. (this is what is returned by CQLSH)
cqlsh> select
   ...     count(row_key) as num_samples, sum(count) as total, max(count)
as max
   ... from
   ...     notification.json_timeseries
   ... where
   ...     row_key in (Œred', Œblue')
   ...     and created > '2016-03-12 00:30:00+0000'
   ...     and created <= '2016-03-12 04:30:00+0000'
   ... allow filtering;

 num_samples | total| max
-------------+------------------+---------------
        3242 |11277 |  17


Here is  my pyspark select statement. Notice the Œcreated column encodes
the timezone¹. I am running this on my local mac (in PST timezone) and
connecting to my data center (which runs on UTC) over a VPN.

rawDF = sqlContext.read\
        .format("org.apache.spark.sql.cassandra")\
        .options(table="json_timeseries", keyspace="notification")\
        .load() 


rawDF.registerTempTable(tmpTableName)



stmnt = "select \
        row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
        unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
        to_utc_timestamp(created, 'gmt') as gmt \
        from \
        rawTable \
        where \
                (created > '{0}') and (created <= '{1}') \
                and \
                        (row_key = Œred' or row_key = Œblue¹) \
        )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')

rawDF = sqlCtx.sql(stmnt).cache()




I get a different values for row count, max, Š

If I convert the UTC time stamp string to my local timezone the row count
matches the count returned by  cqlsh

# pst works, matches cassandra cqlsh
# .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')

Am I doing something wrong in my pyspark code?


Kind regards

Andy



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to