Did the OP propose that?
On Mon, Sep 1, 2014 at 10:53 AM, Jack Krupansky <j...@basetechnology.com> wrote: > One comment on deletions – aren’t deletions kind of an anti-pattern for > modern data processing, such as sensor data, time series data, and social > media? I mean, isn’t it usually better to return a full history of the > data, with some aging scheme, and manage the tracking of which values are > “current” (or “recent”)? Shouldn’t we be looking for and promoting “write > once” approaches as a much stronger preference/pattern? Or maybe I should > say “write once and bulk delete on aging” rather than the exercise in > futility of doing a massive number of deletes and updates in place? > > -- Jack Krupansky > > *From:* Laing, Michael <michael.la...@nytimes.com> > *Sent:* Monday, September 1, 2014 9:33 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Help with select IN query in cassandra > > This should work for your query requirements - 2 tables w same info > because disk is cheap and writes are fast so optimize for reads: > > CREATE TABLE sensor_asset ( > asset_id text, > event_time timestamp, > tuuid timeuuid, > sensor_reading map<text, text>, > sensor_serial_number text, > sensor_type int, > PRIMARY KEY ((asset_id), event_time) > ); > > CREATE TABLE sensor_tuuid ( > asset_id text, > event_time timestamp, > tuuid timeuuid, > sensor_reading map<text, text>, > sensor_serial_number text, > sensor_type int, > PRIMARY KEY (tuuid) > ); > > 1. Give me all sensor data for an asset: > > select * from sensor_asset where asset_id = <asset>; > > 2. Give me sensor data that matches a set of timeuuids: > > select * from sensor_tuuid where tuuid in (<tuuid1>, <tuuid2>, ...); > > 3. Give me all sensor data for an asset collected after | before | between > event_time(s): > > select * from sensor_asset where asset_id = <asset> and event_time > <ts1>; > select * from sensor_asset where asset_id = <asset> and event_time < > <ts1>; > select * from sensor_asset where asset_id = <asset> and event_time < > <ts1> and event_time > <ts2>; > > *** > > Many people (not me) handle sensor data, so there may be better overall > approaches considering volumes, deletion, compaction etc. > > But the above is simple and should make your current approach workable as > you iterate toward a complete solution. > > Cheers, > ml > > > > On Sun, Aug 31, 2014 at 11:08 PM, Subodh Nijsure <subodh.nijs...@gmail.com > > wrote: > >> Thanks for your help Michael. >> >> If specifying asset_id would help I can construct queries that can >> include asset_id >> >> So I have been "playing" around with PRIMARY KEY definition and >> following table definition >> >> CREATE TABLE sensor_info_table ( >> asset_id text, >> event_time timestamp, >> "timestamp" timeuuid, >> sensor_reading map<text, text>, >> sensor_serial_number text, >> sensor_type int, >> PRIMARY KEY ((asset_id, "timestamp"), event_time) >> ); >> >> It does what I want to do, and I removed the index for timestamp item >> since now it is part of primary key and thus my query like this works. >> >> SELECT * from sigsense.sensor_info_table where asset_id='3' AND >> timestamp IN ( >> 17830bb0-316a-11e4-800f-b888e30f5d17,16ddbdfe-316a-11e4-9f50-b888e30f5d17 >> ); >> >> But now this doesn't work it give >> >> SELECT * from sensor_info_table where asset_id='3' ; >> >> Bad Request: Partition key part timestamp must be restricted since >> preceding part is >> >> I am keeping index on event_time as I sometime need to query something >> "give me all data since time x" i.e. something like this works. >> >> SELECT * from sensor_info_table where event_time > '2014-08-31 >> 16:54:02-0700' ALLOW FILTERING; >> >> However if I do this things then this don't work: >> >> SELECT * from sensor_info_table where asset_id='3' AND event_time > >> '2014-08-31 16:54:02-0700'; >> >> Bad Request: Partition key part timestamp must be restricted since >> preceding part is >> >> Also I am not conformable with fact that I need to specify ALLOW >> FILTERING. >> >> I guess cassandra schema design task asks designer to write down >> queries before designing schema. >> >> For the above table definition I want to do following queries: >> >> - Give me all sensor data for given asset. >> - Give me sensor data that matches given set of timeuuids >> - Give me all sendor data for a given asset, that were collected after >> | before | between certain event_time. >> >> Given these query criteria how should I construct my schema? One >> thought has occurred to me is make three tables with each item >> asset_id , event_time, timeuuid as primary keys and depending on type >> of query choose the table to do query upon. That seems like a waste of >> resources (disk, cpu ), also increasing insert times(!) but thats the >> way things need to happen in cassandra world its okay. ( I am >> two-three weeks into learning about cassandra). >> >> -Subodh >> >> On Sun, Aug 31, 2014 at 6:44 PM, Laing, Michael >> <michael.la...@nytimes.com> wrote: >> > Oh it must be late - I missed the fact that you didn't want to specify >> > asset_id. The above queries will still work but you have to use 'allow >> > filtering' - generally not a good idea. I'll look again in the morning. >> > >> > >> > On Sun, Aug 31, 2014 at 9:41 PM, Laing, Michael < >> michael.la...@nytimes.com> >> > wrote: >> >> >> >> Hmm. Because the clustering key is (event_time, "timestamp"), >> event_time >> >> must be specified as well - hopefully that info is available to the ux. >> >> >> >> Unfortunately you will then hit another problem with your query: you >> are >> >> selecting a collection field... this will not work with IN on >> "timestamp". >> >> >> >> So you could select all the "timestamp"s for an asset_id/event_time: >> >>> >> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time = >> >>> 1231234; >> >> >> >> >> >> Or you could apply a range of "timestamp"s: >> >>> >> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time = >> >>> 1231234 and "timestamp" > 1d934af3-3178-11e4-ba8d-406c8f1838fa and >> >>> "timestamp" < 20b82021-3178-11e4-abc2-406c8f1838fa; >> >> >> >> >> >> BTW the secondary indices are not a good idea: high cardinality and of >> no >> >> use in this query that I can see. >> >> >> >> ml >> >> >> >> >> >> On Sun, Aug 31, 2014 at 8:40 PM, Subodh Nijsure < >> subodh.nijs...@gmail.com> >> >> wrote: >> >>> >> >>> Not really event time stamp is created by the sensor when it reads >> data >> >>> and timestamp is something server creates when inserting data into >> >>> cassandra db. At later point in time my django ux allows users to >> browse >> >>> this data and reference interesting data points via the timestamp >> field. The >> >>> timestamp field is my bridge between Sal and nosql world. >> >>> >> >>> Subodh >> >>> >> >>> On Aug 31, 2014 5:33 PM, "Laing, Michael" <michael.la...@nytimes.com> >> >>> wrote: >> >>>> >> >>>> Are event_time and timestamp essentially representing the same >> datetime? >> >>>> >> >>>> On Sunday, August 31, 2014, Subodh Nijsure <subodh.nijs...@gmail.com >> > >> >>>> wrote: >> >>>>> >> >>>>> I have following database schema >> >>>>> >> >>>>> CREATE TABLE sensor_info_table ( >> >>>>> asset_id text, >> >>>>> event_time timestamp, >> >>>>> "timestamp" timeuuid, >> >>>>> sensor_reading map<text, text>, >> >>>>> sensor_serial_number text, >> >>>>> sensor_type int, >> >>>>> PRIMARY KEY ((asset_id), event_time, "timestamp") >> >>>>> ); >> >>>>> >> >>>>> CREATE INDEX event_time_index ON sensor_info_table (event_time); >> >>>>> >> >>>>> CREATE INDEX timestamp_index ON sensor_info_table ("timestamp"); >> >>>>> >> >>>>> Now I am able to insert the data into this table, however I am >> unable >> >>>>> to do following query where I want to select items with specific >> >>>>> timeuuid values. >> >>>>> >> >>>>> It gives me following error. >> >>>>> >> >>>>> SELECT * from mydb.sensor_info_table where timestamp IN ( >> >>>>> bfdfa614-3166-11e4-a61d-b888e30f5d17 , >> >>>>> bf4521ac-3166-11e4-87a3-b888e30f5d17) ; >> >>>>> >> >>>>> Bad Request: PRIMARY KEY column "timestamp" cannot be restricted >> >>>>> (preceding column "event_time" is either not restricted or by a >> non-EQ >> >>>>> relation) >> >>>>> >> >>>>> What do I have to do to make this work? >> >>>>> >> >>>>> For what its worth I am using django for my front end development >> and >> >>>>> I am using "timestamp timeuuid" field as unique indentifier to >> >>>>> reference specific sensor reading from django framework -- since >> >>>>> cassandra doesn't have way to generate unique id upon insert (like >> >>>>> old-style rdms's auto-fields). >> >>>>> >> >>>>> >> >>>>> Below is software version info. >> >>>>> >> >>>>> show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 | >> >>>>> Thrift protocol 19.39.0] >> >>>>> >> >>>>> I really don't understand what the error message preceeding column >> >>>>> "event_time" is either not restricted or by no-EQ relation? >> >>>>> >> >>>>> -Subodh Nijsure >> >> >> >> >> > >> > >