Re: DataModelling to query date range

2016-03-24 Thread Vidur Malik
Hi Chris,

I had something slightly different in mind. You would treat it as time
series data, and have one record for each of the days the route was valid.
In your case:
start   | end| valid
New York   Washington 2016-01-01
New York   Washington 2016-01-02
New York   Washington ...
New York   Washington ...
New York   Washington 2016-01-31

Now, your queries will work, I imagine. Again, this may look wasteful, but
the whole philosophy behind Cassandra is that data duplication is all good.

On Thursday, 24 March 2016, Chris Martin  wrote:

> Hi Vidur,
>
> I had a go at your solution but the problem is that it doesn't match
> routes which are valid all throughtout the range queried.  For example if I
> have  route that is valid for all of Jan 2016. I will have a table that
> looks something like this:
>
> start   | end| valid
> New York   Washington 2016-01-01
> New York   Washington 2016-01-31
>
> So if I query for ranges that have at least one bound outside Jan (e.g Jan
> 15 - Feb 15) then the query you gave will work fine.  If, however, I query
> for a range that is completely inside Jan e.g all routes valid on Jan 15th,
>  The I think I'll end up with a query like:
>
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> valid <= 2016-01-15 and valid >= 2016-01-15.
>
> which will return 0 results as it would only match routes that have a
> valid of 2016-01-15 exactly.
>
>  thanks,
>
> Chris
>
>
> On Wed, Mar 23, 2016 at 11:19 PM, Vidur Malik  > wrote:
>
>> Flip the problem over. Instead of storing validTo and validFrom, simply
>> store a valid field and partition by (start, end). This may sound wasteful,
>> but disk is cheap:
>>
>> CREATE TABLE routes (
>> start text,
>> end text,
>> valid timestamp,
>> PRIMARY KEY ((start, end), valid)
>> );
>>
>> Now, you can execute something like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> valid <= 2016-01-31 and valid >= 2016-01-01.
>>
>>
>> On Wed, Mar 23, 2016 at 5:08 PM, Chris Martin > > wrote:
>>
>>> Hi all,
>>>
>>> I have a table that represents a train timetable and looks a bit like
>>> this:
>>>
>>> CREATE TABLE routes (
>>> start text,
>>> end text,
>>> validFrom timestamp,
>>> validTo timestamp,
>>> PRIMARY KEY (start, end, validFrom, validTo)
>>> );
>>>
>>> In this case validFrom is the date that the route becomes valid and
>>> validTo is the date that the route that stops becoming valid.
>>>
>>> If this was SQL I could write a query to find all valid routes between
>>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>>> like:
>>>
>>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>>>
>>> As far as I can tell such a query is impossible with CQL and my current
>>> table structure.  I'm considering running a query like:
>>>
>>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>>> validFrom <= 2016-01-31
>>>
>>> And then filtering the rest of the data app side.  This doesn't seem
>>> ideal though as I'm going to end up fetching much more data (probably
>>> around an order of magnitude more) from Cassandra than I really want.
>>>
>>> Is there a better way to model the data?
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> Vidur Malik
>>
>> [image: ShopKeep] 
>>
>> 800.820.9814
>> <8008209814>
>> [image: ShopKeep]  [image:
>> ShopKeep]  [image: ShopKeep]
>> 
>>
>
>

-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814>
[image: ShopKeep]  [image: ShopKeep]
 [image: ShopKeep]



RE: DataModelling to query date range

2016-03-24 Thread Peer, Oded
You can change the table to support Multi-column slice restrictions

CREATE TABLE routes (
start text,
end text,
year int,
month int,
day int,
PRIMARY KEY (start, end, year, month, day)
);

Then using Multi-column slice restrictions you can query:

SELECT * from routes where start = 'New York' and end = 'Washington' and 
(year,month,day) >= (2016,1,1) and (year,month,day) <= (2016,1,31);

For more details about Multi-column slice restrictions read 
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause

Oded

From: Chris Martin [mailto:ch...@cmartinit.co.uk]
Sent: Thursday, March 24, 2016 9:40 AM
To: user@cassandra.apache.org
Subject: Re: DataModelling to query date range

Ah- that looks interesting!  I'm actaully still on cassandra 2.x but I was 
planning on updgrading anyway.  Once I do so I'll check this one out.


Chris


On Thu, Mar 24, 2016 at 2:57 AM, Henry M 
<henrymanm...@gmail.com<mailto:henrymanm...@gmail.com>> wrote:
I haven't tried the new SASI indexer but it may help: 
https://github.com/apache/cassandra/blob/trunk/doc/SASI.md


On Wed, Mar 23, 2016 at 2:08 PM, Chris Martin 
<ch...@cmartinit.co.uk<mailto:ch...@cmartinit.co.uk>> wrote:
Hi all,

I have a table that represents a train timetable and looks a bit like this:


CREATE TABLE routes (

start text,

end text,

validFrom timestamp,

validTo timestamp,

PRIMARY KEY (start, end, validFrom, validTo)

);

In this case validFrom is the date that the route becomes valid and validTo is 
the date that the route that stops becoming valid.

If this was SQL I could write a query to find all valid routes between New York 
and Washington from Jan 1st 2016 to Jan 31st 2016 using something like:

SELECT * from routes where start = 'New York' and end = 'Washington' and 
validFrom <= 2016-01-31 and validTo >= 2016-01-01.

As far as I can tell such a query is impossible with CQL and my current table 
structure.  I'm considering running a query like:

SELECT * from routes where start = 'New York' and end = 'Washington' and 
validFrom <= 2016-01-31
And then filtering the rest of the data app side.  This doesn't seem ideal 
though as I'm going to end up fetching much more data (probably around an order 
of magnitude more) from Cassandra than I really want.

Is there a better way to model the data?

thanks,

Chris









Re: DataModelling to query date range

2016-03-24 Thread Chris Martin
Ah- that looks interesting!  I'm actaully still on cassandra 2.x but I was
planning on updgrading anyway.  Once I do so I'll check this one out.


Chris


On Thu, Mar 24, 2016 at 2:57 AM, Henry M  wrote:

> I haven't tried the new SASI indexer but it may help:
> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>
>
> On Wed, Mar 23, 2016 at 2:08 PM, Chris Martin 
> wrote:
>
>> Hi all,
>>
>> I have a table that represents a train timetable and looks a bit like
>> this:
>>
>> CREATE TABLE routes (
>> start text,
>> end text,
>> validFrom timestamp,
>> validTo timestamp,
>> PRIMARY KEY (start, end, validFrom, validTo)
>> );
>>
>> In this case validFrom is the date that the route becomes valid and
>> validTo is the date that the route that stops becoming valid.
>>
>> If this was SQL I could write a query to find all valid routes between
>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>> like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>>
>> As far as I can tell such a query is impossible with CQL and my current
>> table structure.  I'm considering running a query like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31
>>
>> And then filtering the rest of the data app side.  This doesn't seem
>> ideal though as I'm going to end up fetching much more data (probably
>> around an order of magnitude more) from Cassandra than I really want.
>>
>> Is there a better way to model the data?
>>
>> thanks,
>>
>> Chris
>>
>>
>>
>>
>


Re: DataModelling to query date range

2016-03-24 Thread Chris Martin
Hi Vidur,

I had a go at your solution but the problem is that it doesn't match routes
which are valid all throughtout the range queried.  For example if I have
 route that is valid for all of Jan 2016. I will have a table that looks
something like this:

start   | end| valid
New York   Washington 2016-01-01
New York   Washington 2016-01-31

So if I query for ranges that have at least one bound outside Jan (e.g Jan
15 - Feb 15) then the query you gave will work fine.  If, however, I query
for a range that is completely inside Jan e.g all routes valid on Jan 15th,
 The I think I'll end up with a query like:

SELECT * from routes where start = 'New York' and end = 'Washington'
and valid <= 2016-01-15 and valid >= 2016-01-15.

which will return 0 results as it would only match routes that have a valid
of 2016-01-15 exactly.

 thanks,

Chris


On Wed, Mar 23, 2016 at 11:19 PM, Vidur Malik  wrote:

> Flip the problem over. Instead of storing validTo and validFrom, simply
> store a valid field and partition by (start, end). This may sound wasteful,
> but disk is cheap:
>
> CREATE TABLE routes (
> start text,
> end text,
> valid timestamp,
> PRIMARY KEY ((start, end), valid)
> );
>
> Now, you can execute something like:
>
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> valid <= 2016-01-31 and valid >= 2016-01-01.
>
>
> On Wed, Mar 23, 2016 at 5:08 PM, Chris Martin 
> wrote:
>
>> Hi all,
>>
>> I have a table that represents a train timetable and looks a bit like
>> this:
>>
>> CREATE TABLE routes (
>> start text,
>> end text,
>> validFrom timestamp,
>> validTo timestamp,
>> PRIMARY KEY (start, end, validFrom, validTo)
>> );
>>
>> In this case validFrom is the date that the route becomes valid and
>> validTo is the date that the route that stops becoming valid.
>>
>> If this was SQL I could write a query to find all valid routes between
>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>> like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>>
>> As far as I can tell such a query is impossible with CQL and my current
>> table structure.  I'm considering running a query like:
>>
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31
>>
>> And then filtering the rest of the data app side.  This doesn't seem
>> ideal though as I'm going to end up fetching much more data (probably
>> around an order of magnitude more) from Cassandra than I really want.
>>
>> Is there a better way to model the data?
>>
>> thanks,
>>
>> Chris
>>
>>
>>
>>
>
>
> --
>
> Vidur Malik
>
> [image: ShopKeep] 
>
> 800.820.9814
> <8008209814>
> [image: ShopKeep]  [image: ShopKeep]
>  [image: ShopKeep]
> 
>


Re: DataModelling to query date range

2016-03-23 Thread Vidur Malik
Flip the problem over. Instead of storing validTo and validFrom, simply
store a valid field and partition by (start, end). This may sound wasteful,
but disk is cheap:

CREATE TABLE routes (
start text,
end text,
valid timestamp,
PRIMARY KEY ((start, end), valid)
);

Now, you can execute something like:

SELECT * from routes where start = 'New York' and end = 'Washington'
and valid <= 2016-01-31 and valid >= 2016-01-01.


On Wed, Mar 23, 2016 at 5:08 PM, Chris Martin  wrote:

> Hi all,
>
> I have a table that represents a train timetable and looks a bit like this:
>
> CREATE TABLE routes (
> start text,
> end text,
> validFrom timestamp,
> validTo timestamp,
> PRIMARY KEY (start, end, validFrom, validTo)
> );
>
> In this case validFrom is the date that the route becomes valid and
> validTo is the date that the route that stops becoming valid.
>
> If this was SQL I could write a query to find all valid routes between New
> York and Washington from Jan 1st 2016 to Jan 31st 2016 using something like:
>
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>
> As far as I can tell such a query is impossible with CQL and my current
> table structure.  I'm considering running a query like:
>
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> validFrom <= 2016-01-31
>
> And then filtering the rest of the data app side.  This doesn't seem ideal
> though as I'm going to end up fetching much more data (probably around an
> order of magnitude more) from Cassandra than I really want.
>
> Is there a better way to model the data?
>
> thanks,
>
> Chris
>
>
>
>


-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814>
[image: ShopKeep]  [image: ShopKeep]
 [image: ShopKeep]