Re: Versioning in cassandra

dawood abdullah Wed, 04 Sep 2013 06:13:36 -0700

Michael,

Your approach solves the problem, thanks for the solution. I was thinking
of another approach as well where in I would create another column family
say file_archive, so whenever an update is made to the File table, I will
create a new version in the File and move the old version to the new
file_archive table. Please let me know if the second approach is fine.


Regards,
Dawood


On Wed, Sep 4, 2013 at 2:47 AM, Laing, Michael <michael.la...@nytimes.com>wrote:

> I use the technique described in my previous message to handle millions of
> messages and their versions.
>
> Actually, I use timeuuid's instead of timestamps, as they have more
> 'uniqueness'. Also I index my maps by a timeuuid that is the complement
> (based on a future date) of a current timeuuid. Since maps are kept sorted
> by key, this means I can just pop off the first one to get the most recent.
>
> The downside of this approach is that you get more stuff returned to you
> from Cassandra than you need. To mitigate that I queue a job to examine and
> correct the situation if, upon doing a read, the number of versions for a
> particular key is higher than some threshold, e.g. 50. There are many ways
> to approach this problem.
>
> Our actual implementation proceeds to another level, as we also have
> replicas of versions. This happens because we process important
> transactions in parallel and can expect up to 9 replicas of each version.
> We journal them all and use them for reporting latencies in our processing
> pipelines as well as for replay when we need to recover application state.
>
> Regards,
>
> Michael
>
>
> On Tue, Sep 3, 2013 at 3:15 PM, Laing, Michael 
> <michael.la...@nytimes.com>wrote:
>
>> try the following. -ml
>>
>> -- put this in <file> and run using 'cqlsh -f <file>
>>
>> DROP KEYSPACE latest;
>>
>> CREATE KEYSPACE latest WITH replication = {
>>     'class': 'SimpleStrategy',
>>     'replication_factor' : 1
>> };
>>
>> USE latest;
>>
>> CREATE TABLE file (
>>     parentid text, -- row_key, same for each version
>>     id text, -- column_key, same for each version
>>     contenttype map<timestamp, text>, -- differs by version, version is
>> the key to the map
>>     PRIMARY KEY (parentid, id)
>> );
>>
>> update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where
>> parentid = 'd1' and id = 'f1';
>> update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where
>> parentid = 'd1' and id = 'f1';
>> update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where
>> parentid = 'd1' and id = 'f2';
>> update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where
>> parentid = 'd1' and id = 'f2';
>>
>> select * from file where parentid = 'd1';
>>
>> -- returns:
>>
>> -- parentid | id | contenttype
>>
>> ------------+----+--------------------------------------------------------------------------
>> --       d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05
>> 00:00:00-0500': 'pdf2'}
>> --       d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05
>> 00:00:00-0500': 'pdf4'}
>>
>> -- use an app to pop off the latest version from the map
>>
>> -- map other varying fields using the same technique as used for
>> contenttype
>>
>>
>>
>> On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra <mishra.v...@gmail.com>wrote:
>>
>>> create table file(id text , parentid text,contenttype text,version
>>> timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING
>>> ORDER BY (version DESC);
>>>
>>> insert into file (id, parentid, version, contenttype, descr, name)
>>> values ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1');
>>> insert into file (id, parentid, version, contenttype, descr, name)
>>> values ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1');
>>> insert into file (id, parentid, version, contenttype, descr, name)
>>> values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>> insert into file (id, parentid, version, contenttype, descr, name)
>>> values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>>> create index on file(parentid);
>>>
>>>
>>> select * from file where id='f1' and parentid='d1' limit 1;
>>>
>>> select * from file where parentid='d1' limit 1;
>>>
>>>
>>> Will it work for you?
>>>
>>> -Vivek
>>>
>>>
>>>
>>>
>>> On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra <mishra.v...@gmail.com>wrote:
>>>
>>>> My bad. I did miss out to read "latest version" part.
>>>>
>>>> -Vivek
>>>>
>>>>
>>>> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah <
>>>> muhammed.daw...@gmail.com> wrote:
>>>>
>>>>> I have tried with both the options creating secondary index and also
>>>>> tried adding parentid to primary key, but I am getting all the files with
>>>>> parentid 'yyy', what I want is the latest version of file with the
>>>>> combination of parentid, fileid. Say below are the records inserted in the
>>>>> file table:
>>>>>
>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>> name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>> name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>> name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>> name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');
>>>>>
>>>>> I want to write a query which returns me second and last record and
>>>>> not the first and third record, because for the first and third record
>>>>> there exists a latest version, for the combination of id and parentid.
>>>>>
>>>>> I am confused If at all this is achievable, please suggest.
>>>>>
>>>>> Dawood
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra 
>>>>> <mishra.v...@gmail.com>wrote:
>>>>>
>>>>>> create secondary index over parentid.
>>>>>> OR
>>>>>> make it part of clustering key
>>>>>>
>>>>>> -Vivek
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>
>>>>>>> Jan,
>>>>>>>
>>>>>>> The solution you gave works spot on, but there is one more
>>>>>>> requirement I forgot to mention. Following is my table structure
>>>>>>>
>>>>>>> CREATE TABLE file (
>>>>>>>   id text,
>>>>>>>   contenttype text,
>>>>>>>   createdby text,
>>>>>>>   createdtime timestamp,
>>>>>>>   description text,
>>>>>>>   name text,
>>>>>>>   parentid text,
>>>>>>>   version timestamp,
>>>>>>>   PRIMARY KEY (id, version)
>>>>>>>
>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>>>>>
>>>>>>>
>>>>>>> The query (select * from file where id = 'xxx' limit 1;) provided
>>>>>>> solves the problem of finding the latest version file. But I have one 
>>>>>>> more
>>>>>>> requirement of finding all the latest version files having parentid say
>>>>>>> 'yyy'.
>>>>>>>
>>>>>>> Please suggest how can this query be achieved.
>>>>>>>
>>>>>>> Dawood
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In my case version can be timestamp as well. What do you suggest
>>>>>>>> version number to be, do you see any problems if I keep version as 
>>>>>>>> counter
>>>>>>>> / timestamp ?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
>>>>>>>> jan.algermis...@nordsc.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02.09.2013, at 20:44, dawood abdullah <
>>>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> > Requirement is like I have a column family say File
>>>>>>>>> >
>>>>>>>>> > create table file(id text primary key, fname text, version int,
>>>>>>>>> mimetype text, content text);
>>>>>>>>> >
>>>>>>>>> > Say, I have few records inserted, when I modify an existing
>>>>>>>>> record (content is modified) a new version needs to be created. As I 
>>>>>>>>> need
>>>>>>>>> to have provision to revert to back any old version whenever required.
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> So, can version be a timestamp? Or does it need to be an integer?
>>>>>>>>>
>>>>>>>>> In the former case, make use of C*'s ordering like so:
>>>>>>>>>
>>>>>>>>> CREATE TABLE file (
>>>>>>>>>    file_id text,
>>>>>>>>>    version timestamp,
>>>>>>>>>    fname text,
>>>>>>>>>    ....
>>>>>>>>>    PRIMARY KEY (file_id,version)
>>>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>>>>>>>
>>>>>>>>> Get the latest file version with
>>>>>>>>>
>>>>>>>>> select * from file where file_id = 'xxx' limit 1;
>>>>>>>>>
>>>>>>>>> If it has to be an integer, use counter columns.
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > Regards,
>>>>>>>>> > Dawood
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
>>>>>>>>> jan.algermis...@nordsc.com> wrote:
>>>>>>>>> > Hi Dawood,
>>>>>>>>> >
>>>>>>>>> > On 02.09.2013, at 16:36, dawood abdullah <
>>>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > > Hi
>>>>>>>>> > > I have a requirement of versioning to be done in Cassandra.
>>>>>>>>> > >
>>>>>>>>> > > Following is my column family definition
>>>>>>>>> > >
>>>>>>>>> > > create table file_details(id text primary key, fname text,
>>>>>>>>> version int, mimetype text);
>>>>>>>>> > >
>>>>>>>>> > > I have a secondary index created on fname column.
>>>>>>>>> > >
>>>>>>>>> > > Whenever I do an insert for the same 'fname', the version
>>>>>>>>> should be incremented. And when I retrieve a row with fname it should
>>>>>>>>> return me the latest version row.
>>>>>>>>> > >
>>>>>>>>> > > Is there a better way to do in Cassandra? Please suggest what
>>>>>>>>> approach needs to be taken.
>>>>>>>>> >
>>>>>>>>> > Can you explain more about your use case?
>>>>>>>>> >
>>>>>>>>> > If the version need not be a small number, but could be a
>>>>>>>>> timestamp, you could make use of C*'s ordering feature , have the 
>>>>>>>>> database
>>>>>>>>> set the new version as a timestamp and retrieve the latest one with a
>>>>>>>>> simple LIMIT 1 query. (I'll explain more when this is an option for 
>>>>>>>>> you).
>>>>>>>>> >
>>>>>>>>> > Jan
>>>>>>>>> >
>>>>>>>>> > P.S. Me being a REST/HTTP head, an alarm rings when I see
>>>>>>>>> 'version' next to 'mimetype' :-) What exactly are you versioning here?
>>>>>>>>> Maybe we can even change the situation from a functional POV?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > >
>>>>>>>>> > > Regards,
>>>>>>>>> > >
>>>>>>>>> > > Dawood
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Versioning in cassandra

Reply via email to