text partition key Bloom filters fp is 1 always, why?

2015-05-13 Thread Anishek Agarwal
Hello,

I have a text partition key for one of the CF. The cfstats on that table
seems to show that the bloom filter false positive ratio is always 1. Also
the bloom filter is using very less space.

Do bloom filters not work well with text partition keys ? I can assume this
as it can no way detect the length of the text and hence would have a very
high false positive.

The text partition key is combined using a long + "_" +
epoch_time_in_hours, would it be better if we have a composite partition
key of the (long, epoch_time_in_hours) rather than combining it as a text
key ?


Thanks
anishek


Count Number of Users in Cassandra column family?

2015-05-13 Thread Check Peck
I have a table like this in Cassandra-

CREATE TABLE DATA_HOLDER (USER_ID TEXT, RECORD_NAME TEXT, RECORD_VALUE
BLOB, PRIMARY KEY (USER_ID, RECORD_NAME));

I want to count distinct USER_ID in my above table? Is there any way I can
do that?

My Cassandra version is:

[cqlsh 4.1.1 | Cassandra 2.0.10.71 | DSE 4.5.2 | CQL spec 3.1.1 |
Thrift protocol 19.39.0]


Re: Viewing Cassandra's Internal table Structure in a CQL world

2015-05-13 Thread Jonathan Haddad
In Cassandra 3.0 there will be a massive rewrite of what an sstable
even is, and the cli will be totally useless to inspect it.  there
won't be "column names" anymore, timestamps will be stored once per
row (assuming they're the same) and a whole slew of other
optimizations.  If you want to look at a table, a community project to
inspect a table byte by byte will be necessary since everything legacy
that made CQL inefficient on disk will be dropped.

https://issues.apache.org/jira/browse/CASSANDRA-8099

On Wed, May 13, 2015 at 1:45 PM, Moshe Kranc  wrote:
> Yes, cassandra-cli still works. But it also tells me that I should switch to
> CQL, and it doesn't want to display CQL3 tables. My question isn't how to
> get the info today – it's whether that info will still be available in the
> future.
>
>
>
> From: DuyHai Doan [mailto:doanduy...@gmail.com]
> Sent: Wednesday, May 13, 2015 10:40 PM
> To: user@cassandra.apache.org
> Subject: Re: Viewing Cassandra's Internal table Structure in a CQL world
>
>
>
> I think that you can still use cassandra-cli from 2.0.x to look into
> internal table structure. Of course you will see bytes instead of "readable"
> values but it's better than nothing. It's already the case for CQL
> collections when you're trying to decode them using cassandra-cli
>
>
>
> On Wed, May 13, 2015 at 9:27 PM, Moshe Kranc  wrote:
>
> CQL is the future, and it provides a great high-level view of keyspaces. (I
> am drinking the Kool-Aid.) But, I believe every C* developer needs to also
> look at the table's internal structure, e.g., what do the column names
> actually look  like. Only by keeping an eye on the physical structure can
> you tune your queries for best performance.
>
>
>
> To date, I have been using cassandra-cli to view the table's internal
> structure. But, I get bombarded with all kinds of warnings about how I
> should switch to CQL and stop using a deprecated product.
>
>
>
> My question: After the revolution (once Cassandra-cli has been retired), how
> am I supposed to look at the table's internal structure? Or, do you believe
> that ultimately there will be no need or value in  looking at the internal
> structure?  (I would disagree.)
>
>
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


RE: Viewing Cassandra's Internal table Structure in a CQL world

2015-05-13 Thread Moshe Kranc
Yes, cassandra-cli still works. But it also tells me that I should switch to 
CQL, and it doesn't want to display CQL3 tables. My question isn't how to get 
the info today – it's whether that info will still be available in the future.

 

From: DuyHai Doan [mailto:doanduy...@gmail.com] 
Sent: Wednesday, May 13, 2015 10:40 PM
To: user@cassandra.apache.org
Subject: Re: Viewing Cassandra's Internal table Structure in a CQL world

 

I think that you can still use cassandra-cli from 2.0.x to look into internal 
table structure. Of course you will see bytes instead of "readable" values but 
it's better than nothing. It's already the case for CQL collections when you're 
trying to decode them using cassandra-cli

 

On Wed, May 13, 2015 at 9:27 PM, Moshe Kranc mailto:moshekr...@gmail.com> > wrote:

CQL is the future, and it provides a great high-level view of keyspaces. (I am 
drinking the Kool-Aid.) But, I believe every C* developer needs to also look at 
the table's internal structure, e.g., what do the column names actually look  
like. Only by keeping an eye on the physical structure can you tune your 
queries for best performance.

 

To date, I have been using cassandra-cli to view the table's internal 
structure. But, I get bombarded with all kinds of warnings about how I should 
switch to CQL and stop using a deprecated product.

 

My question: After the revolution (once Cassandra-cli has been retired), how am 
I supposed to look at the table's internal structure? Or, do you believe that 
ultimately there will be no need or value in  looking at the internal 
structure?  (I would disagree.)

 

 



Re: Viewing Cassandra's Internal table Structure in a CQL world

2015-05-13 Thread DuyHai Doan
I think that you can still use cassandra-cli from 2.0.x to look into
internal table structure. Of course you will see bytes instead of
"readable" values but it's better than nothing. It's already the case for
CQL collections when you're trying to decode them using cassandra-cli

On Wed, May 13, 2015 at 9:27 PM, Moshe Kranc  wrote:

> CQL is the future, and it provides a great high-level view of keyspaces.
> (I am drinking the Kool-Aid.) But, I believe every C* developer needs to
> also look at the table's internal structure, e.g., what do the column names
> actually look  like. Only by keeping an eye on the physical structure can
> you tune your queries for best performance.
>
>
>
> To date, I have been using cassandra-cli to view the table's internal
> structure. But, I get bombarded with all kinds of warnings about how I
> should switch to CQL and stop using a deprecated product.
>
>
>
> My question: After the revolution (once Cassandra-cli has been retired),
> how am I supposed to look at the table's internal structure? Or, do you
> believe that ultimately there will be no need or value in  looking at the
> internal structure?  (I would disagree.)
>
>
>


Viewing Cassandra's Internal table Structure in a CQL world

2015-05-13 Thread Moshe Kranc
CQL is the future, and it provides a great high-level view of keyspaces. (I
am drinking the Kool-Aid.) But, I believe every C* developer needs to also
look at the table's internal structure, e.g., what do the column names
actually look  like. Only by keeping an eye on the physical structure can
you tune your queries for best performance.

 

To date, I have been using cassandra-cli to view the table's internal
structure. But, I get bombarded with all kinds of warnings about how I
should switch to CQL and stop using a deprecated product.

 

My question: After the revolution (once Cassandra-cli has been retired), how
am I supposed to look at the table's internal structure? Or, do you believe
that ultimately there will be no need or value in  looking at the internal
structure?  (I would disagree.)

 



Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Robert Coli
On Wed, May 13, 2015 at 4:37 AM, Peer, Oded  wrote:

>  The cost of issuing an UPDATE that won’t update anything is compaction
> overhead. Since you stated it’s rare for rows to be updated then the
> overhead should be negligible.
>

It's also the cost of seeking into tables which contain the row fragment.

UPDATE with identical timestamp is generally speaking an anti-pattern in
log structured immutable storage; only acceptable if done very rarely...
probably not if done every hour.

=Rob


Re: Consistency Issues

2015-05-13 Thread Robert Wille
Timestamps have millisecond granularity. If you make multiple writes within the 
same millisecond, then the outcome is not deterministic.

Also, make sure you are running ntp. Clock skew will manifest itself similarly.

On May 13, 2015, at 3:47 AM, Jared Rodriguez 
mailto:jrodrig...@kitedesk.com>> wrote:

Thanks for the feedback.  We have dug in deeper and upgraded to Cassandra 
2.0.14 and are seeing the same issue.  What appears to be happening is that if 
a record is initially written, then the first read is fine.  But if we 
immediately update that record with a second write, that then the second read 
is problematic.

We have a 4 node cluster and a replication factor of 2.  What seems to be 
happening on the initial write the record is sent to nodes A and B.  If a 
secondary write (update) of the record occurs while the record is in the 
memtable and not yet written to the sstable of A or B, that the next read 
returns nothing.

We are continuing to dig in and get as much detail as possible before opening 
this as a JIRA.

On Tue, May 12, 2015 at 6:51 PM, Robert Coli 
mailto:rc...@eventbrite.com>> wrote:
On Tue, May 12, 2015 at 12:35 PM, Michael Shuler 
mailto:mich...@pbandjelly.org>> wrote:
This is a 4 node cluster running Cassandra 2.0.6

Can you reproduce the same issue on 2.0.14? (or better yet, the cassandra-2.0 
branch HEAD, which will soon ship 2.0.15) If you get the same results, please, 
open a JIRA with the reproduction steps.

And if you do file such a JIRA, please let the list know the JIRA URL, to close 
the loop!

=Rob




--
Jared Rodriguez




Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Robert Wille
You probably shouldn’t use batch updates. Your records are probably unrelated 
to each other, and therefore there really is no reason to use batches. Use 
asynchronous queries to improve performance. executeAsync() is your friend.

A common misconception is that batches will improve performance. They don’t. 
Mostly they just increase the load on your cluster.

In my project, I have written a collection of classes that help me manage 
asynchronous queries. They aren’t complicated and didn’t take very long to 
write, but they take away most of the pain that occurs when you need to execute 
a whole bunch of asynchronous queries, and want to meter them out, wait for 
them to complete, etc. I probably execute 75% of my queries asynchronously. Its 
relatively painless.

On May 13, 2015, at 6:51 AM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:

Can lightweight txns be used in a batch update?

On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
The 6k is only the starting value, its expected to scale up to ~200 million 
records.

On Wed, May 13, 2015 at 5:44 PM, Robert Wille 
mailto:rwi...@fold3.com>> wrote:
You could use lightweight transactions to update only if the record is newer. 
It doesn’t avoid the read, it just happens under the covers, so it’s not really 
going to be faster compared to a read-before-write pattern (which is an 
anti-pattern, BTW). It is probably the easiest way to avoid getting a whole 
bunch of copies of each record.

But even with a read-before-write pattern, I don’t understand why you are 
worried about 6K records per hour. That’s nothing. You’re probably looking at 
several milliseconds to do the read and write for each record (depending on 
your storage, RF and CL), so you’re probably looking at under a minute to do 6K 
records. If you do them in parallel, you’re probably looking at several 
seconds. I don’t get why something that probably takes less than a minute that 
is done once an hour is a problem.

BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter (e.g. a 
semaphore) to ensure that you don’t execute more than X queries at a time.

Robert

On May 13, 2015, at 6:20 AM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:

But your previous email talked about when T1 is different:

> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you 
> store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you ran:

Update  where foo=bar USING TIMESTAMP = 1000

and 1 hour later, you ran exactly the same query again. In this case, the value 
of T is the same for both queries. Would that still cause multiple values to be 
stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
It will cause an overhead (compaction and read) as I described in the previous 
email.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 3:13 PM

To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)


> I don’t understand the ETL use case and its relevance here. Can you provide 
> more details?

Basically, every 1 hour a job runs which queries an external API and gets some 
records. Then, I want to take only new or updated records, and insert / update 
them in cassandra. For records that are already in cassandra and aren't 
modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine 
whether a record was changed or not (if it was, it'd be updated, if not, it'd 
be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' 
query for every record, in order to determine if db lastModified < api 
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a 
value that was previously used, still create that overhead, or will they be 
ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued 
another update where TIMESTAMP is still X, will that 2nd update essentially get 
ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: , 
When you read the value of V from the DB you read both , , 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove  from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, not

Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
Can lightweight txns be used in a batch update?

On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar  wrote:

> The 6k is only the starting value, its expected to scale up to ~200
> million records.
>
> On Wed, May 13, 2015 at 5:44 PM, Robert Wille  wrote:
>
>>  You could use lightweight transactions to update only if the record is
>> newer. It doesn’t avoid the read, it just happens under the covers, so it’s
>> not really going to be faster compared to a read-before-write pattern
>> (which is an anti-pattern, BTW). It is probably the easiest way to avoid
>> getting a whole bunch of copies of each record.
>>
>>  But even with a read-before-write pattern, I don’t understand why you
>> are worried about 6K records per hour. That’s nothing. You’re probably
>> looking at several milliseconds to do the read and write for each record
>> (depending on your storage, RF and CL), so you’re probably looking at under
>> a minute to do 6K records. If you do them in parallel, you’re probably
>> looking at several seconds. I don’t get why something that probably takes
>> less than a minute that is done once an hour is a problem.
>>
>>  BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter
>> (e.g. a semaphore) to ensure that you don’t execute more than X queries at
>> a time.
>>
>>  Robert
>>
>>  On May 13, 2015, at 6:20 AM, Ali Akhtar  wrote:
>>
>>  But your previous email talked about when T1 is different:
>>
>>  > Assume timestamp T1 < T2 and you stored value V with timestamp T2.
>> Then you store V’ with timestamp T1.
>>
>>  What if you issue an update twice, but with the same timestamp? E.g if
>> you ran:
>>
>>  Update  where foo=bar USING TIMESTAMP = 1000
>>
>>  and 1 hour later, you ran exactly the same query again. In this case,
>> the value of T is the same for both queries. Would that still cause
>> multiple values to be stored?
>>
>> On Wed, May 13, 2015 at 5:17 PM, Peer, Oded  wrote:
>>
>>>  It will cause an overhead (compaction and read) as I described in the
>>> previous email.
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 3:13 PM
>>>
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Updating only modified records (where lastModified <
>>> current date)
>>>
>>>
>>>
>>> > I don’t understand the ETL use case and its relevance here. Can you
>>> provide more details?
>>>
>>>
>>>
>>> Basically, every 1 hour a job runs which queries an external API and
>>> gets some records. Then, I want to take only new or updated records, and
>>> insert / update them in cassandra. For records that are already in
>>> cassandra and aren't modified, I want to ignore them.
>>>
>>>
>>>
>>> Each record returns a lastModified datetime, I want to use that to
>>> determine whether a record was changed or not (if it was, it'd be updated,
>>> if not, it'd be ignored).
>>>
>>>
>>>
>>> The issue was, I'm having to do a 'select lastModified from table where
>>> id = ?' query for every record, in order to determine if db lastModified <
>>> api lastModified or not. I was wondering if there was a way to avoid that.
>>>
>>>
>>>
>>> If I use 'USING TIMESTAMP', would subsequent updates where lastModified
>>> is a value that was previously used, still create that overhead, or will
>>> they be ignored?
>>>
>>>
>>>
>>> E.g if I issued an update where TIMESTAMP is X, then 1 hour later I
>>> issued another update where TIMESTAMP is still X, will that 2nd update
>>> essentially get ignored, or will it cause any overhead?
>>>
>>>
>>>
>>> On Wed, May 13, 2015 at 5:02 PM, Peer, Oded  wrote:
>>>
>>> USING TIMESTAMP doesn’t avoid compaction overhead.
>>>
>>> When you modify data the value is stored along with a timestamp
>>> indicating the timestamp of the value.
>>>
>>> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
>>> you store V’ with timestamp T1.
>>>
>>> Now you have two values of V in the DB: , 
>>>
>>> When you read the value of V from the DB you read both , ,
>>> Cassandra resolves the conflict by comparing the timestamp and returns V.
>>>
>>> Compaction will later take care and remove  from the DB.
>>>
>>>
>>>
>>> I don’t understand the ETL use case and its relevance here. Can you
>>> provide more details?
>>>
>>>
>>>
>>> UPDATE in Cassandra updates specific rows. All of them are updated,
>>> nothing is ignored.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 2:43 PM
>>>
>>>
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Updating only modified records (where lastModified <
>>> current date)
>>>
>>>
>>>
>>> Its rare for an existing record to have changes, but the etl job runs
>>> every hour, therefore it will send updates each time, regardless of whether
>>> there were changes or not.
>>>
>>>
>>>
>>> (I'm assuming that USING TIMESTAMP here will avoid the compaction
>>> overhead, since that will cause it to not run any updates unless the
>>> timestamp is actually > last update timestam

Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
The 6k is only the starting value, its expected to scale up to ~200 million
records.

On Wed, May 13, 2015 at 5:44 PM, Robert Wille  wrote:

>  You could use lightweight transactions to update only if the record is
> newer. It doesn’t avoid the read, it just happens under the covers, so it’s
> not really going to be faster compared to a read-before-write pattern
> (which is an anti-pattern, BTW). It is probably the easiest way to avoid
> getting a whole bunch of copies of each record.
>
>  But even with a read-before-write pattern, I don’t understand why you are
> worried about 6K records per hour. That’s nothing. You’re probably looking
> at several milliseconds to do the read and write for each record (depending
> on your storage, RF and CL), so you’re probably looking at under a minute
> to do 6K records. If you do them in parallel, you’re probably looking at
> several seconds. I don’t get why something that probably takes less than a
> minute that is done once an hour is a problem.
>
>  BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter
> (e.g. a semaphore) to ensure that you don’t execute more than X queries at
> a time.
>
>  Robert
>
>  On May 13, 2015, at 6:20 AM, Ali Akhtar  wrote:
>
>  But your previous email talked about when T1 is different:
>
>  > Assume timestamp T1 < T2 and you stored value V with timestamp T2.
> Then you store V’ with timestamp T1.
>
>  What if you issue an update twice, but with the same timestamp? E.g if
> you ran:
>
>  Update  where foo=bar USING TIMESTAMP = 1000
>
>  and 1 hour later, you ran exactly the same query again. In this case,
> the value of T is the same for both queries. Would that still cause
> multiple values to be stored?
>
> On Wed, May 13, 2015 at 5:17 PM, Peer, Oded  wrote:
>
>>  It will cause an overhead (compaction and read) as I described in the
>> previous email.
>>
>>
>>
>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>> *Sent:* Wednesday, May 13, 2015 3:13 PM
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Updating only modified records (where lastModified <
>> current date)
>>
>>
>>
>> > I don’t understand the ETL use case and its relevance here. Can you
>> provide more details?
>>
>>
>>
>> Basically, every 1 hour a job runs which queries an external API and gets
>> some records. Then, I want to take only new or updated records, and insert
>> / update them in cassandra. For records that are already in cassandra and
>> aren't modified, I want to ignore them.
>>
>>
>>
>> Each record returns a lastModified datetime, I want to use that to
>> determine whether a record was changed or not (if it was, it'd be updated,
>> if not, it'd be ignored).
>>
>>
>>
>> The issue was, I'm having to do a 'select lastModified from table where
>> id = ?' query for every record, in order to determine if db lastModified <
>> api lastModified or not. I was wondering if there was a way to avoid that.
>>
>>
>>
>> If I use 'USING TIMESTAMP', would subsequent updates where lastModified
>> is a value that was previously used, still create that overhead, or will
>> they be ignored?
>>
>>
>>
>> E.g if I issued an update where TIMESTAMP is X, then 1 hour later I
>> issued another update where TIMESTAMP is still X, will that 2nd update
>> essentially get ignored, or will it cause any overhead?
>>
>>
>>
>> On Wed, May 13, 2015 at 5:02 PM, Peer, Oded  wrote:
>>
>> USING TIMESTAMP doesn’t avoid compaction overhead.
>>
>> When you modify data the value is stored along with a timestamp
>> indicating the timestamp of the value.
>>
>> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
>> you store V’ with timestamp T1.
>>
>> Now you have two values of V in the DB: , 
>>
>> When you read the value of V from the DB you read both , ,
>> Cassandra resolves the conflict by comparing the timestamp and returns V.
>>
>> Compaction will later take care and remove  from the DB.
>>
>>
>>
>> I don’t understand the ETL use case and its relevance here. Can you
>> provide more details?
>>
>>
>>
>> UPDATE in Cassandra updates specific rows. All of them are updated,
>> nothing is ignored.
>>
>>
>>
>>
>>
>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>> *Sent:* Wednesday, May 13, 2015 2:43 PM
>>
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Updating only modified records (where lastModified <
>> current date)
>>
>>
>>
>> Its rare for an existing record to have changes, but the etl job runs
>> every hour, therefore it will send updates each time, regardless of whether
>> there were changes or not.
>>
>>
>>
>> (I'm assuming that USING TIMESTAMP here will avoid the compaction
>> overhead, since that will cause it to not run any updates unless the
>> timestamp is actually > last update timestamp?)
>>
>>
>>
>> Also, is there a way to get the number of rows which were updated /
>> ignored?
>>
>>
>>
>> On Wed, May 13, 2015 at 4:37 PM, Peer, Oded  wrote:
>>
>> The cost of issuing an UPDATE that won’t update anything is compaction
>> overhead

Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Robert Wille
You could use lightweight transactions to update only if the record is newer. 
It doesn’t avoid the read, it just happens under the covers, so it’s not really 
going to be faster compared to a read-before-write pattern (which is an 
anti-pattern, BTW). It is probably the easiest way to avoid getting a whole 
bunch of copies of each record.

But even with a read-before-write pattern, I don’t understand why you are 
worried about 6K records per hour. That’s nothing. You’re probably looking at 
several milliseconds to do the read and write for each record (depending on 
your storage, RF and CL), so you’re probably looking at under a minute to do 6K 
records. If you do them in parallel, you’re probably looking at several 
seconds. I don’t get why something that probably takes less than a minute that 
is done once an hour is a problem.

BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter (e.g. a 
semaphore) to ensure that you don’t execute more than X queries at a time.

Robert

On May 13, 2015, at 6:20 AM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:

But your previous email talked about when T1 is different:

> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you 
> store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you ran:

Update  where foo=bar USING TIMESTAMP = 1000

and 1 hour later, you ran exactly the same query again. In this case, the value 
of T is the same for both queries. Would that still cause multiple values to be 
stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
It will cause an overhead (compaction and read) as I described in the previous 
email.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 3:13 PM

To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)


> I don’t understand the ETL use case and its relevance here. Can you provide 
> more details?

Basically, every 1 hour a job runs which queries an external API and gets some 
records. Then, I want to take only new or updated records, and insert / update 
them in cassandra. For records that are already in cassandra and aren't 
modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine 
whether a record was changed or not (if it was, it'd be updated, if not, it'd 
be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' 
query for every record, in order to determine if db lastModified < api 
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a 
value that was previously used, still create that overhead, or will they be 
ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued 
another update where TIMESTAMP is still X, will that 2nd update essentially get 
ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: , 
When you read the value of V from the DB you read both , , 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove  from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is 
ignored.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM

To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

Its rare for an existing record to have changes, but the etl job runs every 
hour, therefore it will send updates each time, regardless of whether there 
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, 
since that will cause it to not run any updates unless the timestamp is 
actually > last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.com

Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
But your previous email talked about when T1 is different:

> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
you store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you
ran:

Update  where foo=bar USING TIMESTAMP = 1000

and 1 hour later, you ran exactly the same query again. In this case, the
value of T is the same for both queries. Would that still cause multiple
values to be stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded  wrote:

>  It will cause an overhead (compaction and read) as I described in the
> previous email.
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 3:13 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> > I don’t understand the ETL use case and its relevance here. Can you
> provide more details?
>
>
>
> Basically, every 1 hour a job runs which queries an external API and gets
> some records. Then, I want to take only new or updated records, and insert
> / update them in cassandra. For records that are already in cassandra and
> aren't modified, I want to ignore them.
>
>
>
> Each record returns a lastModified datetime, I want to use that to
> determine whether a record was changed or not (if it was, it'd be updated,
> if not, it'd be ignored).
>
>
>
> The issue was, I'm having to do a 'select lastModified from table where id
> = ?' query for every record, in order to determine if db lastModified < api
> lastModified or not. I was wondering if there was a way to avoid that.
>
>
>
> If I use 'USING TIMESTAMP', would subsequent updates where lastModified is
> a value that was previously used, still create that overhead, or will they
> be ignored?
>
>
>
> E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued
> another update where TIMESTAMP is still X, will that 2nd update essentially
> get ignored, or will it cause any overhead?
>
>
>
> On Wed, May 13, 2015 at 5:02 PM, Peer, Oded  wrote:
>
> USING TIMESTAMP doesn’t avoid compaction overhead.
>
> When you modify data the value is stored along with a timestamp indicating
> the timestamp of the value.
>
> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
> you store V’ with timestamp T1.
>
> Now you have two values of V in the DB: , 
>
> When you read the value of V from the DB you read both , ,
> Cassandra resolves the conflict by comparing the timestamp and returns V.
>
> Compaction will later take care and remove  from the DB.
>
>
>
> I don’t understand the ETL use case and its relevance here. Can you
> provide more details?
>
>
>
> UPDATE in Cassandra updates specific rows. All of them are updated,
> nothing is ignored.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:43 PM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Its rare for an existing record to have changes, but the etl job runs
> every hour, therefore it will send updates each time, regardless of whether
> there were changes or not.
>
>
>
> (I'm assuming that USING TIMESTAMP here will avoid the compaction
> overhead, since that will cause it to not run any updates unless the
> timestamp is actually > last update timestamp?)
>
>
>
> Also, is there a way to get the number of rows which were updated /
> ignored?
>
>
>
> On Wed, May 13, 2015 at 4:37 PM, Peer, Oded  wrote:
>
> The cost of issuing an UPDATE that won’t update anything is compaction
> overhead. Since you stated it’s rare for rows to be updated then the
> overhead should be negligible.
>
>
>
> The easiest way to convert a milliseconds timestamp long value to
> microseconds is to multiply by 1000.
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:15 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
> producing the microsecond timestamp ?
>
>
>
> On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar  wrote:
>
> If specifying 'using' timestamp, the docs say to provide microseconds, but
> where are these microseconds obtained from? I have regular java.util.Date
> objects, I can get the time in milliseconds (i.e the unix timestamp), how
> would I convert that to microseconds?
>
>
>
> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar  wrote:
>
> Thanks Peter, that's interesting. I didn't know of that option.
>
>
>
> If updates don't create tombstones (and i'm already taking pains to ensure
> no nulls are present in queries), then is there no cost to just submitting
> an update for everything regardless of whether lastModified has changed?
>
>
>
> Thanks.
>
>
>
> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded  wrote:
>
> You can use the “last modified” value as the TIMESTAMP f

RE: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Peer, Oded
It will cause an overhead (compaction and read) as I described in the previous 
email.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 3:13 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

> I don’t understand the ETL use case and its relevance here. Can you provide 
> more details?

Basically, every 1 hour a job runs which queries an external API and gets some 
records. Then, I want to take only new or updated records, and insert / update 
them in cassandra. For records that are already in cassandra and aren't 
modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine 
whether a record was changed or not (if it was, it'd be updated, if not, it'd 
be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' 
query for every record, in order to determine if db lastModified < api 
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a 
value that was previously used, still create that overhead, or will they be 
ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued 
another update where TIMESTAMP is still X, will that 2nd update essentially get 
ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: , 
When you read the value of V from the DB you read both , , 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove  from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is 
ignored.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM

To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

Its rare for an existing record to have changes, but the etl job runs every 
hour, therefore it will send updates each time, regardless of whether there 
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, 
since that will cause it to not run any updates unless the timestamp is 
actually > last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing 
the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but 
where are these microseconds obtained from? I have regular java.util.Date 
objects, I can get the time in milliseconds (i.e the unix timestamp), how would 
I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no 
nulls are present in queries), then is there no cost to just submitting an 
update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date > the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.org
Subject: Updating only modified records (where lastModified < current date)

I'm running 

Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
> I don’t understand the ETL use case and its relevance here. Can you
provide more details?

Basically, every 1 hour a job runs which queries an external API and gets
some records. Then, I want to take only new or updated records, and insert
/ update them in cassandra. For records that are already in cassandra and
aren't modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to
determine whether a record was changed or not (if it was, it'd be updated,
if not, it'd be ignored).

The issue was, I'm having to do a 'select lastModified from table where id
= ?' query for every record, in order to determine if db lastModified < api
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is
a value that was previously used, still create that overhead, or will they
be ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued
another update where TIMESTAMP is still X, will that 2nd update essentially
get ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded  wrote:

>  USING TIMESTAMP doesn’t avoid compaction overhead.
>
> When you modify data the value is stored along with a timestamp indicating
> the timestamp of the value.
>
> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
> you store V’ with timestamp T1.
>
> Now you have two values of V in the DB: , 
>
> When you read the value of V from the DB you read both , ,
> Cassandra resolves the conflict by comparing the timestamp and returns V.
>
> Compaction will later take care and remove  from the DB.
>
>
>
> I don’t understand the ETL use case and its relevance here. Can you
> provide more details?
>
>
>
> UPDATE in Cassandra updates specific rows. All of them are updated,
> nothing is ignored.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:43 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Its rare for an existing record to have changes, but the etl job runs
> every hour, therefore it will send updates each time, regardless of whether
> there were changes or not.
>
>
>
> (I'm assuming that USING TIMESTAMP here will avoid the compaction
> overhead, since that will cause it to not run any updates unless the
> timestamp is actually > last update timestamp?)
>
>
>
> Also, is there a way to get the number of rows which were updated /
> ignored?
>
>
>
> On Wed, May 13, 2015 at 4:37 PM, Peer, Oded  wrote:
>
> The cost of issuing an UPDATE that won’t update anything is compaction
> overhead. Since you stated it’s rare for rows to be updated then the
> overhead should be negligible.
>
>
>
> The easiest way to convert a milliseconds timestamp long value to
> microseconds is to multiply by 1000.
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:15 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
> producing the microsecond timestamp ?
>
>
>
> On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar  wrote:
>
> If specifying 'using' timestamp, the docs say to provide microseconds, but
> where are these microseconds obtained from? I have regular java.util.Date
> objects, I can get the time in milliseconds (i.e the unix timestamp), how
> would I convert that to microseconds?
>
>
>
> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar  wrote:
>
> Thanks Peter, that's interesting. I didn't know of that option.
>
>
>
> If updates don't create tombstones (and i'm already taking pains to ensure
> no nulls are present in queries), then is there no cost to just submitting
> an update for everything regardless of whether lastModified has changed?
>
>
>
> Thanks.
>
>
>
> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded  wrote:
>
> You can use the “last modified” value as the TIMESTAMP for your UPDATE
> operation.
>
> This way the values will only be updated if lastModified date > the
> lastModified you have in the DB.
>
>
>
> Updates to values don’t create tombstones. Only deletes (either by
> executing delete, inserting a null value or by setting a TTL) create
> tombstones.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 1:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Updating only modified records (where lastModified < current
> date)
>
>
>
> I'm running some ETL jobs, where the pattern is the following:
>
>
>
> 1- Get some records from an external API,
>
>
>
> 2- For each record, see if its lastModified date > the lastModified i have
> in db (or if I don't have that record in db)
>
>
>
> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
> Otherwise, run an update query and update that record.
>
>
>
> (

RE: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Peer, Oded
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: , 
When you read the value of V from the DB you read both , , 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove  from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is 
ignored.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

Its rare for an existing record to have changes, but the etl job runs every 
hour, therefore it will send updates each time, regardless of whether there 
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, 
since that will cause it to not run any updates unless the timestamp is 
actually > last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing 
the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but 
where are these microseconds obtained from? I have regular java.util.Date 
objects, I can get the time in milliseconds (i.e the unix timestamp), how would 
I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no 
nulls are present in queries), then is there no cost to just submitting an 
update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date > the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.org
Subject: Updating only modified records (where lastModified < current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date > the lastModified i have in 
db (or if I don't have that record in db)

3- If lastModified < dbLastModified, the item wasn't changed, ignore it. 
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned 
about tombstones).

The problem however is, since I have to query each record's lastModified, one 
at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from 
myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be 
appreciated.

Thanks.






Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
Its rare for an existing record to have changes, but the etl job runs every
hour, therefore it will send updates each time, regardless of whether there
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead,
since that will cause it to not run any updates unless the timestamp is
actually > last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded  wrote:

>  The cost of issuing an UPDATE that won’t update anything is compaction
> overhead. Since you stated it’s rare for rows to be updated then the
> overhead should be negligible.
>
>
>
> The easiest way to convert a milliseconds timestamp long value to
> microseconds is to multiply by 1000.
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:15 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
> producing the microsecond timestamp ?
>
>
>
> On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar  wrote:
>
> If specifying 'using' timestamp, the docs say to provide microseconds, but
> where are these microseconds obtained from? I have regular java.util.Date
> objects, I can get the time in milliseconds (i.e the unix timestamp), how
> would I convert that to microseconds?
>
>
>
> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar  wrote:
>
> Thanks Peter, that's interesting. I didn't know of that option.
>
>
>
> If updates don't create tombstones (and i'm already taking pains to ensure
> no nulls are present in queries), then is there no cost to just submitting
> an update for everything regardless of whether lastModified has changed?
>
>
>
> Thanks.
>
>
>
> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded  wrote:
>
> You can use the “last modified” value as the TIMESTAMP for your UPDATE
> operation.
>
> This way the values will only be updated if lastModified date > the
> lastModified you have in the DB.
>
>
>
> Updates to values don’t create tombstones. Only deletes (either by
> executing delete, inserting a null value or by setting a TTL) create
> tombstones.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 1:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Updating only modified records (where lastModified < current
> date)
>
>
>
> I'm running some ETL jobs, where the pattern is the following:
>
>
>
> 1- Get some records from an external API,
>
>
>
> 2- For each record, see if its lastModified date > the lastModified i have
> in db (or if I don't have that record in db)
>
>
>
> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
> Otherwise, run an update query and update that record.
>
>
>
> (It is rare for existing records to get updated, so I'm not that concerned
> about tombstones).
>
>
>
> The problem however is, since I have to query each record's lastModified,
> one at a time, that's adding a major bottleneck to my job.
>
>
>
> E.g if I have 6k records, I have to run a total of 6k 'select lastModified
> from myTable where id = ?' queries.
>
>
>
> Is there a better way, am I doing anything wrong, etc? Any suggestions
> would be appreciated.
>
>
>
> Thanks.
>
>
>
>
>
>
>


RE: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Peer, Oded
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified < current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing 
the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but 
where are these microseconds obtained from? I have regular java.util.Date 
objects, I can get the time in milliseconds (i.e the unix timestamp), how would 
I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar 
mailto:ali.rac...@gmail.com>> wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no 
nulls are present in queries), then is there no cost to just submitting an 
update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded 
mailto:oded.p...@rsa.com>> wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date > the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.org
Subject: Updating only modified records (where lastModified < current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date > the lastModified i have in 
db (or if I don't have that record in db)

3- If lastModified < dbLastModified, the item wasn't changed, ignore it. 
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned 
about tombstones).

The problem however is, since I have to query each record's lastModified, one 
at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from 
myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be 
appreciated.

Thanks.





Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
Is there a way in the java driver, to get the number of rows that an update
was applied to?

On Wed, May 13, 2015 at 4:33 PM, Ali Akhtar  wrote:

> Thanks. So supplying the timestamp with the update (via using) should fix
> that, right? (By skipping updates where lastModified < dbLastModified).
>
> I'm currently doing TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() )
> and that has worked for inserts, however how do I verify that future
> updates are ignored and aren't run again?
>
> On Wed, May 13, 2015 at 4:29 PM, Ken Hancock 
> wrote:
>
>> While updates don't create tombstones, overwrites create a similar
>> performance penalty at the read phase.  That key will need to be fetched
>> from every SSTable where it resides so the "most recent" column can be
>> returned.
>>
>>
>>
>>
>> On Wed, May 13, 2015 at 6:38 AM, Peer, Oded  wrote:
>>
>>>  You can use the “last modified” value as the TIMESTAMP for your UPDATE
>>> operation.
>>>
>>> This way the values will only be updated if lastModified date > the
>>> lastModified you have in the DB.
>>>
>>>
>>>
>>> Updates to values don’t create tombstones. Only deletes (either by
>>> executing delete, inserting a null value or by setting a TTL) create
>>> tombstones.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 1:27 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Updating only modified records (where lastModified < current
>>> date)
>>>
>>>
>>>
>>> I'm running some ETL jobs, where the pattern is the following:
>>>
>>>
>>>
>>> 1- Get some records from an external API,
>>>
>>>
>>>
>>> 2- For each record, see if its lastModified date > the lastModified i
>>> have in db (or if I don't have that record in db)
>>>
>>>
>>>
>>> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
>>> Otherwise, run an update query and update that record.
>>>
>>>
>>>
>>> (It is rare for existing records to get updated, so I'm not that
>>> concerned about tombstones).
>>>
>>>
>>>
>>> The problem however is, since I have to query each record's
>>> lastModified, one at a time, that's adding a major bottleneck to my job.
>>>
>>>
>>>
>>> E.g if I have 6k records, I have to run a total of 6k 'select
>>> lastModified from myTable where id = ?' queries.
>>>
>>>
>>>
>>> Is there a better way, am I doing anything wrong, etc? Any suggestions
>>> would be appreciated.
>>>
>>>
>>>
>>> Thanks.
>>>
>>
>>
>>
>>
>>
>>
>


Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
Thanks. So supplying the timestamp with the update (via using) should fix
that, right? (By skipping updates where lastModified < dbLastModified).

I'm currently doing TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) and
that has worked for inserts, however how do I verify that future updates
are ignored and aren't run again?

On Wed, May 13, 2015 at 4:29 PM, Ken Hancock 
wrote:

> While updates don't create tombstones, overwrites create a similar
> performance penalty at the read phase.  That key will need to be fetched
> from every SSTable where it resides so the "most recent" column can be
> returned.
>
>
>
>
> On Wed, May 13, 2015 at 6:38 AM, Peer, Oded  wrote:
>
>>  You can use the “last modified” value as the TIMESTAMP for your UPDATE
>> operation.
>>
>> This way the values will only be updated if lastModified date > the
>> lastModified you have in the DB.
>>
>>
>>
>> Updates to values don’t create tombstones. Only deletes (either by
>> executing delete, inserting a null value or by setting a TTL) create
>> tombstones.
>>
>>
>>
>>
>>
>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>> *Sent:* Wednesday, May 13, 2015 1:27 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Updating only modified records (where lastModified < current
>> date)
>>
>>
>>
>> I'm running some ETL jobs, where the pattern is the following:
>>
>>
>>
>> 1- Get some records from an external API,
>>
>>
>>
>> 2- For each record, see if its lastModified date > the lastModified i
>> have in db (or if I don't have that record in db)
>>
>>
>>
>> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
>> Otherwise, run an update query and update that record.
>>
>>
>>
>> (It is rare for existing records to get updated, so I'm not that
>> concerned about tombstones).
>>
>>
>>
>> The problem however is, since I have to query each record's lastModified,
>> one at a time, that's adding a major bottleneck to my job.
>>
>>
>>
>> E.g if I have 6k records, I have to run a total of 6k 'select
>> lastModified from myTable where id = ?' queries.
>>
>>
>>
>> Is there a better way, am I doing anything wrong, etc? Any suggestions
>> would be appreciated.
>>
>>
>>
>> Thanks.
>>
>
>
>
>
>
>


Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ken Hancock
While updates don't create tombstones, overwrites create a similar
performance penalty at the read phase.  That key will need to be fetched
from every SSTable where it resides so the "most recent" column can be
returned.



On Wed, May 13, 2015 at 6:38 AM, Peer, Oded  wrote:

>  You can use the “last modified” value as the TIMESTAMP for your UPDATE
> operation.
>
> This way the values will only be updated if lastModified date > the
> lastModified you have in the DB.
>
>
>
> Updates to values don’t create tombstones. Only deletes (either by
> executing delete, inserting a null value or by setting a TTL) create
> tombstones.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 1:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Updating only modified records (where lastModified < current
> date)
>
>
>
> I'm running some ETL jobs, where the pattern is the following:
>
>
>
> 1- Get some records from an external API,
>
>
>
> 2- For each record, see if its lastModified date > the lastModified i have
> in db (or if I don't have that record in db)
>
>
>
> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
> Otherwise, run an update query and update that record.
>
>
>
> (It is rare for existing records to get updated, so I'm not that concerned
> about tombstones).
>
>
>
> The problem however is, since I have to query each record's lastModified,
> one at a time, that's adding a major bottleneck to my job.
>
>
>
> E.g if I have 6k records, I have to run a total of 6k 'select lastModified
> from myTable where id = ?' queries.
>
>
>
> Is there a better way, am I doing anything wrong, etc? Any suggestions
> would be appreciated.
>
>
>
> Thanks.
>


Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
producing the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar  wrote:

> If specifying 'using' timestamp, the docs say to provide microseconds, but
> where are these microseconds obtained from? I have regular java.util.Date
> objects, I can get the time in milliseconds (i.e the unix timestamp), how
> would I convert that to microseconds?
>
> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar  wrote:
>
>> Thanks Peter, that's interesting. I didn't know of that option.
>>
>> If updates don't create tombstones (and i'm already taking pains to
>> ensure no nulls are present in queries), then is there no cost to just
>> submitting an update for everything regardless of whether lastModified has
>> changed?
>>
>> Thanks.
>>
>> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded  wrote:
>>
>>>  You can use the “last modified” value as the TIMESTAMP for your UPDATE
>>> operation.
>>>
>>> This way the values will only be updated if lastModified date > the
>>> lastModified you have in the DB.
>>>
>>>
>>>
>>> Updates to values don’t create tombstones. Only deletes (either by
>>> executing delete, inserting a null value or by setting a TTL) create
>>> tombstones.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 1:27 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Updating only modified records (where lastModified < current
>>> date)
>>>
>>>
>>>
>>> I'm running some ETL jobs, where the pattern is the following:
>>>
>>>
>>>
>>> 1- Get some records from an external API,
>>>
>>>
>>>
>>> 2- For each record, see if its lastModified date > the lastModified i
>>> have in db (or if I don't have that record in db)
>>>
>>>
>>>
>>> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
>>> Otherwise, run an update query and update that record.
>>>
>>>
>>>
>>> (It is rare for existing records to get updated, so I'm not that
>>> concerned about tombstones).
>>>
>>>
>>>
>>> The problem however is, since I have to query each record's
>>> lastModified, one at a time, that's adding a major bottleneck to my job.
>>>
>>>
>>>
>>> E.g if I have 6k records, I have to run a total of 6k 'select
>>> lastModified from myTable where id = ?' queries.
>>>
>>>
>>>
>>> Is there a better way, am I doing anything wrong, etc? Any suggestions
>>> would be appreciated.
>>>
>>>
>>>
>>> Thanks.
>>>
>>
>>
>


Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
If specifying 'using' timestamp, the docs say to provide microseconds, but
where are these microseconds obtained from? I have regular java.util.Date
objects, I can get the time in milliseconds (i.e the unix timestamp), how
would I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar  wrote:

> Thanks Peter, that's interesting. I didn't know of that option.
>
> If updates don't create tombstones (and i'm already taking pains to ensure
> no nulls are present in queries), then is there no cost to just submitting
> an update for everything regardless of whether lastModified has changed?
>
> Thanks.
>
> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded  wrote:
>
>>  You can use the “last modified” value as the TIMESTAMP for your UPDATE
>> operation.
>>
>> This way the values will only be updated if lastModified date > the
>> lastModified you have in the DB.
>>
>>
>>
>> Updates to values don’t create tombstones. Only deletes (either by
>> executing delete, inserting a null value or by setting a TTL) create
>> tombstones.
>>
>>
>>
>>
>>
>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>> *Sent:* Wednesday, May 13, 2015 1:27 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Updating only modified records (where lastModified < current
>> date)
>>
>>
>>
>> I'm running some ETL jobs, where the pattern is the following:
>>
>>
>>
>> 1- Get some records from an external API,
>>
>>
>>
>> 2- For each record, see if its lastModified date > the lastModified i
>> have in db (or if I don't have that record in db)
>>
>>
>>
>> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
>> Otherwise, run an update query and update that record.
>>
>>
>>
>> (It is rare for existing records to get updated, so I'm not that
>> concerned about tombstones).
>>
>>
>>
>> The problem however is, since I have to query each record's lastModified,
>> one at a time, that's adding a major bottleneck to my job.
>>
>>
>>
>> E.g if I have 6k records, I have to run a total of 6k 'select
>> lastModified from myTable where id = ?' queries.
>>
>>
>>
>> Is there a better way, am I doing anything wrong, etc? Any suggestions
>> would be appreciated.
>>
>>
>>
>> Thanks.
>>
>
>


Re: Insert Vs Updates - Both create tombstones

2015-05-13 Thread Ali Akhtar
Sorry, wrong thread. Disregard the above

On Wed, May 13, 2015 at 4:08 PM, Ali Akhtar  wrote:

> If specifying 'using' timestamp, the docs say to provide microseconds, but
> where are these microseconds obtained from? I have regular java.util.Date
> objects, I can get the time in milliseconds (i.e the unix timestamp), how
> would I convert that to microseconds?
>
> On Wed, May 13, 2015 at 3:45 PM, Peer, Oded  wrote:
>
>>  Under the assumption that when you update the columns you also update
>> the TTL for the columns then a tombstone won’t be created for those columns.
>>
>> Remember that TTL is set on columns (or “cells”), not on rows, so your
>> description of updating a row is slightly misleading. If every query
>> updates different columns then different columns might expire at different
>> times.
>>
>>
>>
>> *From:* Walsh, Stephen [mailto:stephen.wa...@aspect.com]
>> *Sent:* Wednesday, May 13, 2015 1:35 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Insert Vs Updates - Both create tombstones
>>
>>
>>
>> Quick Question,
>>
>>
>>
>> Our team is under much debate, we are trying to find out if an Update on
>> a row with a TTL will create a tombstone.
>>
>>
>>
>> E.G
>>
>>
>>
>> We have one row with a TTL, if we keep “updating” that row before the TTL
>> is hit, will a tombstone be created.
>>
>> I believe it will, but want to confirm.
>>
>>
>>
>> So if that’s is  true,
>>
>> And if our TTL is 10 seconds and we “update” the row every second, will
>> 10 tombstones be created after 10 seconds? Or just 1?
>>
>> (and does the same apply for “insert”)
>>
>>
>>
>> Regards
>>
>> Stephen Walsh
>>
>>
>>
>>
>>
>> This email (including any attachments) is proprietary to Aspect Software,
>> Inc. and may contain information that is confidential. If you have received
>> this message in error, please do not read, copy or forward this message.
>> Please notify the sender immediately, delete it from your system and
>> destroy any copies. You may not further disclose or distribute this email
>> or its attachments.
>>
>
>


Re: Insert Vs Updates - Both create tombstones

2015-05-13 Thread Ali Akhtar
If specifying 'using' timestamp, the docs say to provide microseconds, but
where are these microseconds obtained from? I have regular java.util.Date
objects, I can get the time in milliseconds (i.e the unix timestamp), how
would I convert that to microseconds?

On Wed, May 13, 2015 at 3:45 PM, Peer, Oded  wrote:

>  Under the assumption that when you update the columns you also update
> the TTL for the columns then a tombstone won’t be created for those columns.
>
> Remember that TTL is set on columns (or “cells”), not on rows, so your
> description of updating a row is slightly misleading. If every query
> updates different columns then different columns might expire at different
> times.
>
>
>
> *From:* Walsh, Stephen [mailto:stephen.wa...@aspect.com]
> *Sent:* Wednesday, May 13, 2015 1:35 PM
> *To:* user@cassandra.apache.org
> *Subject:* Insert Vs Updates - Both create tombstones
>
>
>
> Quick Question,
>
>
>
> Our team is under much debate, we are trying to find out if an Update on a
> row with a TTL will create a tombstone.
>
>
>
> E.G
>
>
>
> We have one row with a TTL, if we keep “updating” that row before the TTL
> is hit, will a tombstone be created.
>
> I believe it will, but want to confirm.
>
>
>
> So if that’s is  true,
>
> And if our TTL is 10 seconds and we “update” the row every second, will 10
> tombstones be created after 10 seconds? Or just 1?
>
> (and does the same apply for “insert”)
>
>
>
> Regards
>
> Stephen Walsh
>
>
>
>
>
> This email (including any attachments) is proprietary to Aspect Software,
> Inc. and may contain information that is confidential. If you have received
> this message in error, please do not read, copy or forward this message.
> Please notify the sender immediately, delete it from your system and
> destroy any copies. You may not further disclose or distribute this email
> or its attachments.
>


Re: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure
no nulls are present in queries), then is there no cost to just submitting
an update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded  wrote:

>  You can use the “last modified” value as the TIMESTAMP for your UPDATE
> operation.
>
> This way the values will only be updated if lastModified date > the
> lastModified you have in the DB.
>
>
>
> Updates to values don’t create tombstones. Only deletes (either by
> executing delete, inserting a null value or by setting a TTL) create
> tombstones.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 1:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Updating only modified records (where lastModified < current
> date)
>
>
>
> I'm running some ETL jobs, where the pattern is the following:
>
>
>
> 1- Get some records from an external API,
>
>
>
> 2- For each record, see if its lastModified date > the lastModified i have
> in db (or if I don't have that record in db)
>
>
>
> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
> Otherwise, run an update query and update that record.
>
>
>
> (It is rare for existing records to get updated, so I'm not that concerned
> about tombstones).
>
>
>
> The problem however is, since I have to query each record's lastModified,
> one at a time, that's adding a major bottleneck to my job.
>
>
>
> E.g if I have 6k records, I have to run a total of 6k 'select lastModified
> from myTable where id = ?' queries.
>
>
>
> Is there a better way, am I doing anything wrong, etc? Any suggestions
> would be appreciated.
>
>
>
> Thanks.
>


RE: Insert Vs Updates - Both create tombstones

2015-05-13 Thread Peer, Oded
Under the assumption that when you update the columns you also update the TTL 
for the columns then a tombstone won't be created for those columns.
Remember that TTL is set on columns (or "cells"), not on rows, so your 
description of updating a row is slightly misleading. If every query updates 
different columns then different columns might expire at different times.

From: Walsh, Stephen [mailto:stephen.wa...@aspect.com]
Sent: Wednesday, May 13, 2015 1:35 PM
To: user@cassandra.apache.org
Subject: Insert Vs Updates - Both create tombstones

Quick Question,

Our team is under much debate, we are trying to find out if an Update on a row 
with a TTL will create a tombstone.

E.G

We have one row with a TTL, if we keep "updating" that row before the TTL is 
hit, will a tombstone be created.
I believe it will, but want to confirm.

So if that's is  true,
And if our TTL is 10 seconds and we "update" the row every second, will 10 
tombstones be created after 10 seconds? Or just 1?
(and does the same apply for "insert")

Regards
Stephen Walsh


This email (including any attachments) is proprietary to Aspect Software, Inc. 
and may contain information that is confidential. If you have received this 
message in error, please do not read, copy or forward this message. Please 
notify the sender immediately, delete it from your system and destroy any 
copies. You may not further disclose or distribute this email or its 
attachments.


RE: Updating only modified records (where lastModified < current date)

2015-05-13 Thread Peer, Oded
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date > the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.org
Subject: Updating only modified records (where lastModified < current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date > the lastModified i have in 
db (or if I don't have that record in db)

3- If lastModified < dbLastModified, the item wasn't changed, ignore it. 
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned 
about tombstones).

The problem however is, since I have to query each record's lastModified, one 
at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from 
myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be 
appreciated.

Thanks.


Insert Vs Updates - Both create tombstones

2015-05-13 Thread Walsh, Stephen
Quick Question,

Our team is under much debate, we are trying to find out if an Update on a row 
with a TTL will create a tombstone.

E.G

We have one row with a TTL, if we keep "updating" that row before the TTL is 
hit, will a tombstone be created.
I believe it will, but want to confirm.

So if that's is  true,
And if our TTL is 10 seconds and we "update" the row every second, will 10 
tombstones be created after 10 seconds? Or just 1?
(and does the same apply for "insert")

Regards
Stephen Walsh


This email (including any attachments) is proprietary to Aspect Software, Inc. 
and may contain information that is confidential. If you have received this 
message in error, please do not read, copy or forward this message. Please 
notify the sender immediately, delete it from your system and destroy any 
copies. You may not further disclose or distribute this email or its 
attachments.


Updating only modified records (where lastModified < current date)

2015-05-13 Thread Ali Akhtar
I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date > the lastModified i have
in db (or if I don't have that record in db)

3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned
about tombstones).

The problem however is, since I have to query each record's lastModified,
one at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified
from myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions
would be appreciated.

Thanks.


Re: Consistency Issues

2015-05-13 Thread Jared Rodriguez
Thanks for the feedback.  We have dug in deeper and upgraded to Cassandra
2.0.14 and are seeing the same issue.  What appears to be happening is that
if a record is initially written, then the first read is fine.  But if we
immediately update that record with a second write, that then the second
read is problematic.

We have a 4 node cluster and a replication factor of 2.  What seems to be
happening on the initial write the record is sent to nodes A and B.  If a
secondary write (update) of the record occurs while the record is in the
memtable and not yet written to the sstable of A or B, that the next read
returns nothing.

We are continuing to dig in and get as much detail as possible before
opening this as a JIRA.

On Tue, May 12, 2015 at 6:51 PM, Robert Coli  wrote:

> On Tue, May 12, 2015 at 12:35 PM, Michael Shuler 
> wrote:
>
>> This is a 4 node cluster running Cassandra 2.0.6
>>>
>>
>> Can you reproduce the same issue on 2.0.14? (or better yet, the
>> cassandra-2.0 branch HEAD, which will soon ship 2.0.15) If you get the same
>> results, please, open a JIRA with the reproduction steps.
>
>
> And if you do file such a JIRA, please let the list know the JIRA URL, to
> close the loop!
>
> =Rob
>
>



-- 
Jared Rodriguez