Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
If specifying 'using' timestamp, the docs say to provide microseconds, but
where are these microseconds obtained from? I have regular java.util.Date
objects, I can get the time in milliseconds (i.e the unix timestamp), how
would I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Thanks Peter, that's interesting. I didn't know of that option.

 If updates don't create tombstones (and i'm already taking pains to ensure
 no nulls are present in queries), then is there no cost to just submitting
 an update for everything regardless of whether lastModified has changed?

 Thanks.

 On Wed, May 13, 2015 at 3:38 PM, Peer, Oded oded.p...@rsa.com wrote:

  You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i
 have in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that
 concerned about tombstones).



 The problem however is, since I have to query each record's lastModified,
 one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select
 lastModified from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.





Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
producing the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 If specifying 'using' timestamp, the docs say to provide microseconds, but
 where are these microseconds obtained from? I have regular java.util.Date
 objects, I can get the time in milliseconds (i.e the unix timestamp), how
 would I convert that to microseconds?

 On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Thanks Peter, that's interesting. I didn't know of that option.

 If updates don't create tombstones (and i'm already taking pains to
 ensure no nulls are present in queries), then is there no cost to just
 submitting an update for everything regardless of whether lastModified has
 changed?

 Thanks.

 On Wed, May 13, 2015 at 3:38 PM, Peer, Oded oded.p...@rsa.com wrote:

  You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i
 have in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that
 concerned about tombstones).



 The problem however is, since I have to query each record's
 lastModified, one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select
 lastModified from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.






Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
Is there a way in the java driver, to get the number of rows that an update
was applied to?

On Wed, May 13, 2015 at 4:33 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Thanks. So supplying the timestamp with the update (via using) should fix
 that, right? (By skipping updates where lastModified  dbLastModified).

 I'm currently doing TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() )
 and that has worked for inserts, however how do I verify that future
 updates are ignored and aren't run again?

 On Wed, May 13, 2015 at 4:29 PM, Ken Hancock ken.hanc...@schange.com
 wrote:

 While updates don't create tombstones, overwrites create a similar
 performance penalty at the read phase.  That key will need to be fetched
 from every SSTable where it resides so the most recent column can be
 returned.




 On Wed, May 13, 2015 at 6:38 AM, Peer, Oded oded.p...@rsa.com wrote:

  You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i
 have in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that
 concerned about tombstones).



 The problem however is, since I have to query each record's
 lastModified, one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select
 lastModified from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.










Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ken Hancock
While updates don't create tombstones, overwrites create a similar
performance penalty at the read phase.  That key will need to be fetched
from every SSTable where it resides so the most recent column can be
returned.



On Wed, May 13, 2015 at 6:38 AM, Peer, Oded oded.p...@rsa.com wrote:

  You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i have
 in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that concerned
 about tombstones).



 The problem however is, since I have to query each record's lastModified,
 one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select lastModified
 from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.



RE: Updating only modified records (where lastModified current date)

2015-05-13 Thread Peer, Oded
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date  the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.org
Subject: Updating only modified records (where lastModified  current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date  the lastModified i have in 
db (or if I don't have that record in db)

3- If lastModified  dbLastModified, the item wasn't changed, ignore it. 
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned 
about tombstones).

The problem however is, since I have to query each record's lastModified, one 
at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from 
myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be 
appreciated.

Thanks.


Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure
no nulls are present in queries), then is there no cost to just submitting
an update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded oded.p...@rsa.com wrote:

  You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i have
 in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that concerned
 about tombstones).



 The problem however is, since I have to query each record's lastModified,
 one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select lastModified
 from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.



Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
Its rare for an existing record to have changes, but the etl job runs every
hour, therefore it will send updates each time, regardless of whether there
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead,
since that will cause it to not run any updates unless the timestamp is
actually  last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded oded.p...@rsa.com wrote:

  The cost of issuing an UPDATE that won’t update anything is compaction
 overhead. Since you stated it’s rare for rows to be updated then the
 overhead should be negligible.



 The easiest way to convert a milliseconds timestamp long value to
 microseconds is to multiply by 1000.



 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:15 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
 producing the microsecond timestamp ?



 On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 If specifying 'using' timestamp, the docs say to provide microseconds, but
 where are these microseconds obtained from? I have regular java.util.Date
 objects, I can get the time in milliseconds (i.e the unix timestamp), how
 would I convert that to microseconds?



 On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Thanks Peter, that's interesting. I didn't know of that option.



 If updates don't create tombstones (and i'm already taking pains to ensure
 no nulls are present in queries), then is there no cost to just submitting
 an update for everything regardless of whether lastModified has changed?



 Thanks.



 On Wed, May 13, 2015 at 3:38 PM, Peer, Oded oded.p...@rsa.com wrote:

 You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i have
 in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that concerned
 about tombstones).



 The problem however is, since I have to query each record's lastModified,
 one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select lastModified
 from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.









Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
Thanks. So supplying the timestamp with the update (via using) should fix
that, right? (By skipping updates where lastModified  dbLastModified).

I'm currently doing TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) and
that has worked for inserts, however how do I verify that future updates
are ignored and aren't run again?

On Wed, May 13, 2015 at 4:29 PM, Ken Hancock ken.hanc...@schange.com
wrote:

 While updates don't create tombstones, overwrites create a similar
 performance penalty at the read phase.  That key will need to be fetched
 from every SSTable where it resides so the most recent column can be
 returned.




 On Wed, May 13, 2015 at 6:38 AM, Peer, Oded oded.p...@rsa.com wrote:

  You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i
 have in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated, so I'm not that
 concerned about tombstones).



 The problem however is, since I have to query each record's lastModified,
 one at a time, that's adding a major bottleneck to my job.



 E.g if I have 6k records, I have to run a total of 6k 'select
 lastModified from myTable where id = ?' queries.



 Is there a better way, am I doing anything wrong, etc? Any suggestions
 would be appreciated.



 Thanks.









RE: Updating only modified records (where lastModified current date)

2015-05-13 Thread Peer, Oded
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing 
the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but 
where are these microseconds obtained from? I have regular java.util.Date 
objects, I can get the time in milliseconds (i.e the unix timestamp), how would 
I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no 
nulls are present in queries), then is there no cost to just submitting an 
update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date  the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Updating only modified records (where lastModified  current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date  the lastModified i have in 
db (or if I don't have that record in db)

3- If lastModified  dbLastModified, the item wasn't changed, ignore it. 
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned 
about tombstones).

The problem however is, since I have to query each record's lastModified, one 
at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from 
myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be 
appreciated.

Thanks.





RE: Updating only modified records (where lastModified current date)

2015-05-13 Thread Peer, Oded
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1  T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: V,T2, V’,T1
When you read the value of V from the DB you read both V,T2, V’,T1, 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove V’,T1 from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is 
ignored.


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

Its rare for an existing record to have changes, but the etl job runs every 
hour, therefore it will send updates each time, regardless of whether there 
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, 
since that will cause it to not run any updates unless the timestamp is 
actually  last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing 
the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but 
where are these microseconds obtained from? I have regular java.util.Date 
objects, I can get the time in milliseconds (i.e the unix timestamp), how would 
I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no 
nulls are present in queries), then is there no cost to just submitting an 
update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date  the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Updating only modified records (where lastModified  current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date  the lastModified i have in 
db (or if I don't have that record in db)

3- If lastModified  dbLastModified, the item wasn't changed, ignore it. 
Otherwise, run an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned 
about tombstones).

The problem however is, since I have to query each record's lastModified, one 
at a time, that's adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from 
myTable where id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be 
appreciated.

Thanks.






Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
 I don’t understand the ETL use case and its relevance here. Can you
provide more details?

Basically, every 1 hour a job runs which queries an external API and gets
some records. Then, I want to take only new or updated records, and insert
/ update them in cassandra. For records that are already in cassandra and
aren't modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to
determine whether a record was changed or not (if it was, it'd be updated,
if not, it'd be ignored).

The issue was, I'm having to do a 'select lastModified from table where id
= ?' query for every record, in order to determine if db lastModified  api
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is
a value that was previously used, still create that overhead, or will they
be ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued
another update where TIMESTAMP is still X, will that 2nd update essentially
get ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded oded.p...@rsa.com wrote:

  USING TIMESTAMP doesn’t avoid compaction overhead.

 When you modify data the value is stored along with a timestamp indicating
 the timestamp of the value.

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then
 you store V’ with timestamp T1.

 Now you have two values of V in the DB: V,T2, V’,T1

 When you read the value of V from the DB you read both V,T2, V’,T1,
 Cassandra resolves the conflict by comparing the timestamp and returns V.

 Compaction will later take care and remove V’,T1 from the DB.



 I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 UPDATE in Cassandra updates specific rows. All of them are updated,
 nothing is ignored.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:43 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Its rare for an existing record to have changes, but the etl job runs
 every hour, therefore it will send updates each time, regardless of whether
 there were changes or not.



 (I'm assuming that USING TIMESTAMP here will avoid the compaction
 overhead, since that will cause it to not run any updates unless the
 timestamp is actually  last update timestamp?)



 Also, is there a way to get the number of rows which were updated /
 ignored?



 On Wed, May 13, 2015 at 4:37 PM, Peer, Oded oded.p...@rsa.com wrote:

 The cost of issuing an UPDATE that won’t update anything is compaction
 overhead. Since you stated it’s rare for rows to be updated then the
 overhead should be negligible.



 The easiest way to convert a milliseconds timestamp long value to
 microseconds is to multiply by 1000.



 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:15 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
 producing the microsecond timestamp ?



 On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 If specifying 'using' timestamp, the docs say to provide microseconds, but
 where are these microseconds obtained from? I have regular java.util.Date
 objects, I can get the time in milliseconds (i.e the unix timestamp), how
 would I convert that to microseconds?



 On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Thanks Peter, that's interesting. I didn't know of that option.



 If updates don't create tombstones (and i'm already taking pains to ensure
 no nulls are present in queries), then is there no cost to just submitting
 an update for everything regardless of whether lastModified has changed?



 Thanks.



 On Wed, May 13, 2015 at 3:38 PM, Peer, Oded oded.p...@rsa.com wrote:

 You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way the values will only be updated if lastModified date  the
 lastModified you have in the DB.



 Updates to values don’t create tombstones. Only deletes (either by
 executing delete, inserting a null value or by setting a TTL) create
 tombstones.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 1:27 PM
 *To:* user@cassandra.apache.org
 *Subject:* Updating only modified records (where lastModified  current
 date)



 I'm running some ETL jobs, where the pattern is the following:



 1- Get some records from an external API,



 2- For each record, see if its lastModified date  the lastModified i have
 in db (or if I don't have that record in db)



 3- If lastModified  dbLastModified, the item wasn't changed, ignore it.
 Otherwise, run an update query and update that record.



 (It is rare for existing records to get updated

Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
But your previous email talked about when T1 is different:

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then
you store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you
ran:

Update  where foo=bar USING TIMESTAMP = 1000

and 1 hour later, you ran exactly the same query again. In this case, the
value of T is the same for both queries. Would that still cause multiple
values to be stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded oded.p...@rsa.com wrote:

  It will cause an overhead (compaction and read) as I described in the
 previous email.



 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 3:13 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



  I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 Basically, every 1 hour a job runs which queries an external API and gets
 some records. Then, I want to take only new or updated records, and insert
 / update them in cassandra. For records that are already in cassandra and
 aren't modified, I want to ignore them.



 Each record returns a lastModified datetime, I want to use that to
 determine whether a record was changed or not (if it was, it'd be updated,
 if not, it'd be ignored).



 The issue was, I'm having to do a 'select lastModified from table where id
 = ?' query for every record, in order to determine if db lastModified  api
 lastModified or not. I was wondering if there was a way to avoid that.



 If I use 'USING TIMESTAMP', would subsequent updates where lastModified is
 a value that was previously used, still create that overhead, or will they
 be ignored?



 E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued
 another update where TIMESTAMP is still X, will that 2nd update essentially
 get ignored, or will it cause any overhead?



 On Wed, May 13, 2015 at 5:02 PM, Peer, Oded oded.p...@rsa.com wrote:

 USING TIMESTAMP doesn’t avoid compaction overhead.

 When you modify data the value is stored along with a timestamp indicating
 the timestamp of the value.

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then
 you store V’ with timestamp T1.

 Now you have two values of V in the DB: V,T2, V’,T1

 When you read the value of V from the DB you read both V,T2, V’,T1,
 Cassandra resolves the conflict by comparing the timestamp and returns V.

 Compaction will later take care and remove V’,T1 from the DB.



 I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 UPDATE in Cassandra updates specific rows. All of them are updated,
 nothing is ignored.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:43 PM


 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Its rare for an existing record to have changes, but the etl job runs
 every hour, therefore it will send updates each time, regardless of whether
 there were changes or not.



 (I'm assuming that USING TIMESTAMP here will avoid the compaction
 overhead, since that will cause it to not run any updates unless the
 timestamp is actually  last update timestamp?)



 Also, is there a way to get the number of rows which were updated /
 ignored?



 On Wed, May 13, 2015 at 4:37 PM, Peer, Oded oded.p...@rsa.com wrote:

 The cost of issuing an UPDATE that won’t update anything is compaction
 overhead. Since you stated it’s rare for rows to be updated then the
 overhead should be negligible.



 The easiest way to convert a milliseconds timestamp long value to
 microseconds is to multiply by 1000.



 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:15 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
 producing the microsecond timestamp ?



 On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 If specifying 'using' timestamp, the docs say to provide microseconds, but
 where are these microseconds obtained from? I have regular java.util.Date
 objects, I can get the time in milliseconds (i.e the unix timestamp), how
 would I convert that to microseconds?



 On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Thanks Peter, that's interesting. I didn't know of that option.



 If updates don't create tombstones (and i'm already taking pains to ensure
 no nulls are present in queries), then is there no cost to just submitting
 an update for everything regardless of whether lastModified has changed?



 Thanks.



 On Wed, May 13, 2015 at 3:38 PM, Peer, Oded oded.p...@rsa.com wrote:

 You can use the “last modified” value as the TIMESTAMP for your UPDATE
 operation.

 This way

RE: Updating only modified records (where lastModified current date)

2015-05-13 Thread Peer, Oded
It will cause an overhead (compaction and read) as I described in the previous 
email.

From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 3:13 PM
To: user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

 I don’t understand the ETL use case and its relevance here. Can you provide 
 more details?

Basically, every 1 hour a job runs which queries an external API and gets some 
records. Then, I want to take only new or updated records, and insert / update 
them in cassandra. For records that are already in cassandra and aren't 
modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine 
whether a record was changed or not (if it was, it'd be updated, if not, it'd 
be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' 
query for every record, in order to determine if db lastModified  api 
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a 
value that was previously used, still create that overhead, or will they be 
ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued 
another update where TIMESTAMP is still X, will that 2nd update essentially get 
ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1  T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: V,T2, V’,T1
When you read the value of V from the DB you read both V,T2, V’,T1, 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove V’,T1 from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is 
ignored.


From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

Its rare for an existing record to have changes, but the etl job runs every 
hour, therefore it will send updates each time, regardless of whether there 
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, 
since that will cause it to not run any updates unless the timestamp is 
actually  last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing 
the microsecond timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but 
where are these microseconds obtained from? I have regular java.util.Date 
objects, I can get the time in milliseconds (i.e the unix timestamp), how would 
I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no 
nulls are present in queries), then is there no cost to just submitting an 
update for everything regardless of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE 
operation.
This way the values will only be updated if lastModified date  the 
lastModified you have in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing 
delete, inserting a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org

Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
The 6k is only the starting value, its expected to scale up to ~200 million
records.

On Wed, May 13, 2015 at 5:44 PM, Robert Wille rwi...@fold3.com wrote:

  You could use lightweight transactions to update only if the record is
 newer. It doesn’t avoid the read, it just happens under the covers, so it’s
 not really going to be faster compared to a read-before-write pattern
 (which is an anti-pattern, BTW). It is probably the easiest way to avoid
 getting a whole bunch of copies of each record.

  But even with a read-before-write pattern, I don’t understand why you are
 worried about 6K records per hour. That’s nothing. You’re probably looking
 at several milliseconds to do the read and write for each record (depending
 on your storage, RF and CL), so you’re probably looking at under a minute
 to do 6K records. If you do them in parallel, you’re probably looking at
 several seconds. I don’t get why something that probably takes less than a
 minute that is done once an hour is a problem.

  BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter
 (e.g. a semaphore) to ensure that you don’t execute more than X queries at
 a time.

  Robert

  On May 13, 2015, at 6:20 AM, Ali Akhtar ali.rac...@gmail.com wrote:

  But your previous email talked about when T1 is different:

   Assume timestamp T1  T2 and you stored value V with timestamp T2.
 Then you store V’ with timestamp T1.

  What if you issue an update twice, but with the same timestamp? E.g if
 you ran:

  Update  where foo=bar USING TIMESTAMP = 1000

  and 1 hour later, you ran exactly the same query again. In this case,
 the value of T is the same for both queries. Would that still cause
 multiple values to be stored?

 On Wed, May 13, 2015 at 5:17 PM, Peer, Oded oded.p...@rsa.com wrote:

  It will cause an overhead (compaction and read) as I described in the
 previous email.



 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 3:13 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



  I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 Basically, every 1 hour a job runs which queries an external API and gets
 some records. Then, I want to take only new or updated records, and insert
 / update them in cassandra. For records that are already in cassandra and
 aren't modified, I want to ignore them.



 Each record returns a lastModified datetime, I want to use that to
 determine whether a record was changed or not (if it was, it'd be updated,
 if not, it'd be ignored).



 The issue was, I'm having to do a 'select lastModified from table where
 id = ?' query for every record, in order to determine if db lastModified 
 api lastModified or not. I was wondering if there was a way to avoid that.



 If I use 'USING TIMESTAMP', would subsequent updates where lastModified
 is a value that was previously used, still create that overhead, or will
 they be ignored?



 E.g if I issued an update where TIMESTAMP is X, then 1 hour later I
 issued another update where TIMESTAMP is still X, will that 2nd update
 essentially get ignored, or will it cause any overhead?



 On Wed, May 13, 2015 at 5:02 PM, Peer, Oded oded.p...@rsa.com wrote:

 USING TIMESTAMP doesn’t avoid compaction overhead.

 When you modify data the value is stored along with a timestamp
 indicating the timestamp of the value.

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then
 you store V’ with timestamp T1.

 Now you have two values of V in the DB: V,T2, V’,T1

 When you read the value of V from the DB you read both V,T2, V’,T1,
 Cassandra resolves the conflict by comparing the timestamp and returns V.

 Compaction will later take care and remove V’,T1 from the DB.



 I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 UPDATE in Cassandra updates specific rows. All of them are updated,
 nothing is ignored.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:43 PM


 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Its rare for an existing record to have changes, but the etl job runs
 every hour, therefore it will send updates each time, regardless of whether
 there were changes or not.



 (I'm assuming that USING TIMESTAMP here will avoid the compaction
 overhead, since that will cause it to not run any updates unless the
 timestamp is actually  last update timestamp?)



 Also, is there a way to get the number of rows which were updated /
 ignored?



 On Wed, May 13, 2015 at 4:37 PM, Peer, Oded oded.p...@rsa.com wrote:

 The cost of issuing an UPDATE that won’t update anything is compaction
 overhead. Since you stated it’s rare for rows to be updated then the
 overhead should be negligible.



 The easiest way to convert a milliseconds timestamp long value

Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Robert Wille
You probably shouldn’t use batch updates. Your records are probably unrelated 
to each other, and therefore there really is no reason to use batches. Use 
asynchronous queries to improve performance. executeAsync() is your friend.

A common misconception is that batches will improve performance. They don’t. 
Mostly they just increase the load on your cluster.

In my project, I have written a collection of classes that help me manage 
asynchronous queries. They aren’t complicated and didn’t take very long to 
write, but they take away most of the pain that occurs when you need to execute 
a whole bunch of asynchronous queries, and want to meter them out, wait for 
them to complete, etc. I probably execute 75% of my queries asynchronously. Its 
relatively painless.

On May 13, 2015, at 6:51 AM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:

Can lightweight txns be used in a batch update?

On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
The 6k is only the starting value, its expected to scale up to ~200 million 
records.

On Wed, May 13, 2015 at 5:44 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
You could use lightweight transactions to update only if the record is newer. 
It doesn’t avoid the read, it just happens under the covers, so it’s not really 
going to be faster compared to a read-before-write pattern (which is an 
anti-pattern, BTW). It is probably the easiest way to avoid getting a whole 
bunch of copies of each record.

But even with a read-before-write pattern, I don’t understand why you are 
worried about 6K records per hour. That’s nothing. You’re probably looking at 
several milliseconds to do the read and write for each record (depending on 
your storage, RF and CL), so you’re probably looking at under a minute to do 6K 
records. If you do them in parallel, you’re probably looking at several 
seconds. I don’t get why something that probably takes less than a minute that 
is done once an hour is a problem.

BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter (e.g. a 
semaphore) to ensure that you don’t execute more than X queries at a time.

Robert

On May 13, 2015, at 6:20 AM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:

But your previous email talked about when T1 is different:

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then you 
 store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you ran:

Update  where foo=bar USING TIMESTAMP = 1000

and 1 hour later, you ran exactly the same query again. In this case, the value 
of T is the same for both queries. Would that still cause multiple values to be 
stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
It will cause an overhead (compaction and read) as I described in the previous 
email.

From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 3:13 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)


 I don’t understand the ETL use case and its relevance here. Can you provide 
 more details?

Basically, every 1 hour a job runs which queries an external API and gets some 
records. Then, I want to take only new or updated records, and insert / update 
them in cassandra. For records that are already in cassandra and aren't 
modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine 
whether a record was changed or not (if it was, it'd be updated, if not, it'd 
be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' 
query for every record, in order to determine if db lastModified  api 
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a 
value that was previously used, still create that overhead, or will they be 
ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued 
another update where TIMESTAMP is still X, will that 2nd update essentially get 
ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1  T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: V,T2, V’,T1
When you read the value of V from the DB you read both V,T2, V’,T1, 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove V’,T1 from the DB.

I don’t understand the ETL use case and its relevance

Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Robert Wille
You could use lightweight transactions to update only if the record is newer. 
It doesn’t avoid the read, it just happens under the covers, so it’s not really 
going to be faster compared to a read-before-write pattern (which is an 
anti-pattern, BTW). It is probably the easiest way to avoid getting a whole 
bunch of copies of each record.

But even with a read-before-write pattern, I don’t understand why you are 
worried about 6K records per hour. That’s nothing. You’re probably looking at 
several milliseconds to do the read and write for each record (depending on 
your storage, RF and CL), so you’re probably looking at under a minute to do 6K 
records. If you do them in parallel, you’re probably looking at several 
seconds. I don’t get why something that probably takes less than a minute that 
is done once an hour is a problem.

BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter (e.g. a 
semaphore) to ensure that you don’t execute more than X queries at a time.

Robert

On May 13, 2015, at 6:20 AM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:

But your previous email talked about when T1 is different:

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then you 
 store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you ran:

Update  where foo=bar USING TIMESTAMP = 1000

and 1 hour later, you ran exactly the same query again. In this case, the value 
of T is the same for both queries. Would that still cause multiple values to be 
stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
It will cause an overhead (compaction and read) as I described in the previous 
email.

From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 3:13 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)


 I don’t understand the ETL use case and its relevance here. Can you provide 
 more details?

Basically, every 1 hour a job runs which queries an external API and gets some 
records. Then, I want to take only new or updated records, and insert / update 
them in cassandra. For records that are already in cassandra and aren't 
modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine 
whether a record was changed or not (if it was, it'd be updated, if not, it'd 
be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' 
query for every record, in order to determine if db lastModified  api 
lastModified or not. I was wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a 
value that was previously used, still create that overhead, or will they be 
ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued 
another update where TIMESTAMP is still X, will that 2nd update essentially get 
ignored, or will it cause any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the 
timestamp of the value.
Assume timestamp T1  T2 and you stored value V with timestamp T2. Then you 
store V’ with timestamp T1.
Now you have two values of V in the DB: V,T2, V’,T1
When you read the value of V from the DB you read both V,T2, V’,T1, 
Cassandra resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove V’,T1 from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide 
more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is 
ignored.


From: Ali Akhtar [mailto:ali.rac...@gmail.commailto:ali.rac...@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified  current date)

Its rare for an existing record to have changes, but the etl job runs every 
hour, therefore it will send updates each time, regardless of whether there 
were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, 
since that will cause it to not run any updates unless the timestamp is 
actually  last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
The cost of issuing an UPDATE that won’t update anything is compaction 
overhead. Since you stated it’s rare for rows to be updated then the overhead 
should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds 
is to multiply by 1000.

From

Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Ali Akhtar
Can lightweight txns be used in a batch update?

On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 The 6k is only the starting value, its expected to scale up to ~200
 million records.

 On Wed, May 13, 2015 at 5:44 PM, Robert Wille rwi...@fold3.com wrote:

  You could use lightweight transactions to update only if the record is
 newer. It doesn’t avoid the read, it just happens under the covers, so it’s
 not really going to be faster compared to a read-before-write pattern
 (which is an anti-pattern, BTW). It is probably the easiest way to avoid
 getting a whole bunch of copies of each record.

  But even with a read-before-write pattern, I don’t understand why you
 are worried about 6K records per hour. That’s nothing. You’re probably
 looking at several milliseconds to do the read and write for each record
 (depending on your storage, RF and CL), so you’re probably looking at under
 a minute to do 6K records. If you do them in parallel, you’re probably
 looking at several seconds. I don’t get why something that probably takes
 less than a minute that is done once an hour is a problem.

  BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter
 (e.g. a semaphore) to ensure that you don’t execute more than X queries at
 a time.

  Robert

  On May 13, 2015, at 6:20 AM, Ali Akhtar ali.rac...@gmail.com wrote:

  But your previous email talked about when T1 is different:

   Assume timestamp T1  T2 and you stored value V with timestamp T2.
 Then you store V’ with timestamp T1.

  What if you issue an update twice, but with the same timestamp? E.g if
 you ran:

  Update  where foo=bar USING TIMESTAMP = 1000

  and 1 hour later, you ran exactly the same query again. In this case,
 the value of T is the same for both queries. Would that still cause
 multiple values to be stored?

 On Wed, May 13, 2015 at 5:17 PM, Peer, Oded oded.p...@rsa.com wrote:

  It will cause an overhead (compaction and read) as I described in the
 previous email.



 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 3:13 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



  I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 Basically, every 1 hour a job runs which queries an external API and
 gets some records. Then, I want to take only new or updated records, and
 insert / update them in cassandra. For records that are already in
 cassandra and aren't modified, I want to ignore them.



 Each record returns a lastModified datetime, I want to use that to
 determine whether a record was changed or not (if it was, it'd be updated,
 if not, it'd be ignored).



 The issue was, I'm having to do a 'select lastModified from table where
 id = ?' query for every record, in order to determine if db lastModified 
 api lastModified or not. I was wondering if there was a way to avoid that.



 If I use 'USING TIMESTAMP', would subsequent updates where lastModified
 is a value that was previously used, still create that overhead, or will
 they be ignored?



 E.g if I issued an update where TIMESTAMP is X, then 1 hour later I
 issued another update where TIMESTAMP is still X, will that 2nd update
 essentially get ignored, or will it cause any overhead?



 On Wed, May 13, 2015 at 5:02 PM, Peer, Oded oded.p...@rsa.com wrote:

 USING TIMESTAMP doesn’t avoid compaction overhead.

 When you modify data the value is stored along with a timestamp
 indicating the timestamp of the value.

 Assume timestamp T1  T2 and you stored value V with timestamp T2. Then
 you store V’ with timestamp T1.

 Now you have two values of V in the DB: V,T2, V’,T1

 When you read the value of V from the DB you read both V,T2, V’,T1,
 Cassandra resolves the conflict by comparing the timestamp and returns V.

 Compaction will later take care and remove V’,T1 from the DB.



 I don’t understand the ETL use case and its relevance here. Can you
 provide more details?



 UPDATE in Cassandra updates specific rows. All of them are updated,
 nothing is ignored.





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Wednesday, May 13, 2015 2:43 PM


 *To:* user@cassandra.apache.org
 *Subject:* Re: Updating only modified records (where lastModified 
 current date)



 Its rare for an existing record to have changes, but the etl job runs
 every hour, therefore it will send updates each time, regardless of whether
 there were changes or not.



 (I'm assuming that USING TIMESTAMP here will avoid the compaction
 overhead, since that will cause it to not run any updates unless the
 timestamp is actually  last update timestamp?)



 Also, is there a way to get the number of rows which were updated /
 ignored?



 On Wed, May 13, 2015 at 4:37 PM, Peer, Oded oded.p...@rsa.com wrote:

 The cost of issuing an UPDATE that won’t update anything is compaction
 overhead. Since you stated it’s rare

Re: Updating only modified records (where lastModified current date)

2015-05-13 Thread Robert Coli
On Wed, May 13, 2015 at 4:37 AM, Peer, Oded oded.p...@rsa.com wrote:

  The cost of issuing an UPDATE that won’t update anything is compaction
 overhead. Since you stated it’s rare for rows to be updated then the
 overhead should be negligible.


It's also the cost of seeking into tables which contain the row fragment.

UPDATE with identical timestamp is generally speaking an anti-pattern in
log structured immutable storage; only acceptable if done very rarely...
probably not if done every hour.

=Rob