Store counter with non-counter column in the same column family?

2014-12-22 Thread ziju feng
​I was wondering if there is plan to allow ​creating counter column and
standard column in the same table.

Here is my use case:
I want to use counter to count how many users like a given item in my
application. The like count needs to be returned along with details of item
in query. To support querying items in different ways, I use both
application-maintained denormalized index tables and DSE search for
indexing. (DSE search is also used for text searching)

Since current counter implementation doesn't allow having counter columns
and non-counter columns in the same table, I have to propagate the current
count from counter table to the main item table and index tables, so that
like counts can be returned by those index tables without sending extra
requests to counter table and DSE search is able to build index on like
count column in the main item table to support like count related queries
(such as sorting by like count).

IMHO, the only way to sync data between counter table and normal table
within a reasonable time (sub-seconds) currently is to read the current
value from counter table right after the update. However it suffers from
several issues:
1. Read-after-write may not return the correct count when replication
factor  1 unless consistency level ALL/LOCAL_ALL is used
2. There are two extra non-parallelizable round-trips between the
application server and cassandra, which can have great impact on
performance.

If it is possible to store counter in standard column family, only one
write will be needed to update like count in the main table. Counter value
will also be eventually synced between replicas so that there is no need
for application to use extra mechanism like scheduled task to get the
correct counts.

A related issue is lifting the limitation of not allowing updating counter
columns and normal columns in one batch, since it is quite common to not
only have a counter for statistics but also store the details, such as
storing the relation of which user likes which items in my user case.

Any idea?


Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread DuyHai Doan
It's not possible to mix counter and non counter columns because currently
the semantic of counter is only increment/decrement (thus NOT idempotent)
and requires some special handling compared to other C* columns.

On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column and
 standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter columns
 and non-counter columns in the same table, I have to propagate the current
 count from counter table to the main item table and index tables, so that
 like counts can be returned by those index tables without sending extra
 requests to counter table and DSE search is able to build index on like
 count column in the main item table to support like count related queries
 (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write will be needed to update like count in the main table. Counter value
 will also be eventually synced between replicas so that there is no need
 for application to use extra mechanism like scheduled task to get the
 correct counts.

 A related issue is lifting the limitation of not allowing updating counter
 columns and normal columns in one batch, since it is quite common to not
 only have a counter for statistics but also store the details, such as
 storing the relation of which user likes which items in my user case.

 Any idea?




Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread Ryan Svihla
You can cheat it by using the non counter column as part of your primary
key (clustering column specifically) but the cases where this could work
are limited and the places this is a good idea are even more rare.

As for using counters in batches are already a not well regarded concept
and counter batches have a number of troubling behaviors, as already stated
increments aren't idempotent and batch implies retry.

As for DSE search its doing something drastically different internally and
the type of counting its doing is many orders of magnitude faster ( think
bitmask style matching + proper async 2i to minimize fanout cost)

Generally speaking counting accurately while being highly available creates
an interesting set of logical tradeoffs. Example what do you do if you're
not able to communicate between two data centers, but both are up and
serving likes quite happily? Is your counting down? Do you keep counting
but serve up different answers? More accurately since problems are rarely
data center to data center but more frequently between replicas, how much
availability are you willing to give up in exchange for a globally accurate
count?
On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because currently
 the semantic of counter is only increment/decrement (thus NOT idempotent)
 and requires some special handling compared to other C* columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column and
 standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter columns
 and non-counter columns in the same table, I have to propagate the current
 count from counter table to the main item table and index tables, so that
 like counts can be returned by those index tables without sending extra
 requests to counter table and DSE search is able to build index on like
 count column in the main item table to support like count related queries
 (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write will be needed to update like count in the main table. Counter value
 will also be eventually synced between replicas so that there is no need
 for application to use extra mechanism like scheduled task to get the
 correct counts.

 A related issue is lifting the limitation of not allowing updating
 counter columns and normal columns in one batch, since it is quite common
 to not only have a counter for statistics but also store the details, such
 as storing the relation of which user likes which items in my user case.

 Any idea?





Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread ziju feng
I just skimmed through JIRA
https://issues.apache.org/jira/browse/CASSANDRA-4775 and it seems there
has been some effort to make update idempotent. Perhaps the problem can be
fixed in the near future?

Anyway, what is the current best practice for such use case? (Counting and
displaying counts in different queries) I don't need a 100% accurate count
and strong consistency. Performance and application complexity is my main
concern.

Thanks

On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla rsvi...@datastax.com wrote:

 You can cheat it by using the non counter column as part of your primary
 key (clustering column specifically) but the cases where this could work
 are limited and the places this is a good idea are even more rare.

 As for using counters in batches are already a not well regarded concept
 and counter batches have a number of troubling behaviors, as already stated
 increments aren't idempotent and batch implies retry.

 As for DSE search its doing something drastically different internally and
 the type of counting its doing is many orders of magnitude faster ( think
 bitmask style matching + proper async 2i to minimize fanout cost)

 Generally speaking counting accurately while being highly available
 creates an interesting set of logical tradeoffs. Example what do you do if
 you're not able to communicate between two data centers, but both are up
 and serving likes quite happily? Is your counting down? Do you keep
 counting but serve up different answers? More accurately since problems are
 rarely data center to data center but more frequently between replicas, how
 much availability are you willing to give up in exchange for a globally
 accurate count?
 On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because
 currently the semantic of counter is only increment/decrement (thus NOT
 idempotent) and requires some special handling compared to other C* columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column and
 standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter
 columns and non-counter columns in the same table, I have to propagate the
 current count from counter table to the main item table and index tables,
 so that like counts can be returned by those index tables without sending
 extra requests to counter table and DSE search is able to build index on
 like count column in the main item table to support like count related
 queries (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write will be needed to update like count in the main table. Counter value
 will also be eventually synced between replicas so that there is no need
 for application to use extra mechanism like scheduled task to get the
 correct counts.

 A related issue is lifting the limitation of not allowing updating
 counter columns and normal columns in one batch, since it is quite common
 to not only have a counter for statistics but also store the details, such
 as storing the relation of which user likes which items in my user case.

 Any idea?





Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread Ryan Svihla
increment wouldn't be idempotent from the client unless you knew the count
at the time of the update (which you could do with LWT but that has pretty
harsh performance), that particular jira is about how they're laid out and
avoiding race conditions between nodes, which was resolved in 2.1 beta 1
(which is now in officially out in the 2.1.x branch)

General improvements on counters in 2.1 are laid out here
http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters

As for best practice the answer is multiple tables for multiple query
paths, or you can use something like solr or spark, take a look at the
spark cassandra connector for a good way to count on lots of data from lots
of different query paths
https://github.com/datastax/spark-cassandra-connector.



On Mon, Dec 22, 2014 at 9:22 PM, ziju feng pkdog...@gmail.com wrote:

 I just skimmed through JIRA
 https://issues.apache.org/jira/browse/CASSANDRA-4775 and it seems there
 has been some effort to make update idempotent. Perhaps the problem can be
 fixed in the near future?

 Anyway, what is the current best practice for such use case? (Counting and
 displaying counts in different queries) I don't need a 100% accurate count
 and strong consistency. Performance and application complexity is my main
 concern.

 Thanks

 On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla rsvi...@datastax.com
 wrote:

 You can cheat it by using the non counter column as part of your primary
 key (clustering column specifically) but the cases where this could work
 are limited and the places this is a good idea are even more rare.

 As for using counters in batches are already a not well regarded concept
 and counter batches have a number of troubling behaviors, as already stated
 increments aren't idempotent and batch implies retry.

 As for DSE search its doing something drastically different internally
 and the type of counting its doing is many orders of magnitude faster (
 think bitmask style matching + proper async 2i to minimize fanout cost)

 Generally speaking counting accurately while being highly available
 creates an interesting set of logical tradeoffs. Example what do you do if
 you're not able to communicate between two data centers, but both are up
 and serving likes quite happily? Is your counting down? Do you keep
 counting but serve up different answers? More accurately since problems are
 rarely data center to data center but more frequently between replicas, how
 much availability are you willing to give up in exchange for a globally
 accurate count?
 On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because
 currently the semantic of counter is only increment/decrement (thus NOT
 idempotent) and requires some special handling compared to other C* columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column and
 standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter
 columns and non-counter columns in the same table, I have to propagate the
 current count from counter table to the main item table and index tables,
 so that like counts can be returned by those index tables without sending
 extra requests to counter table and DSE search is able to build index on
 like count column in the main item table to support like count related
 queries (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write will be needed to update like count in the main table. Counter value
 will also be eventually synced between replicas so that there is no need
 for application to use extra mechanism like scheduled task to get the
 correct counts.

 A related issue is lifting the limitation of not allowing updating
 counter columns and normal columns in one batch, since it is quite common
 to not only have a counter for statistics but also store the details, such
 as storing the relation of which user likes which items in 

Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread ziju feng
Thanks for the advise, I'll definitely take a look at how Spark works and
how it can help with counting.

One last question: My current implementation of counting is 1) increment
counter 2) read counter immediately after the write 3) write counts to
multiple tables for different query paths and solr. If I switch to Spark,
do I still needs to use counter or counting will be done by spark on
regular table?

On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla rsvi...@datastax.com wrote:

 increment wouldn't be idempotent from the client unless you knew the count
 at the time of the update (which you could do with LWT but that has pretty
 harsh performance), that particular jira is about how they're laid out and
 avoiding race conditions between nodes, which was resolved in 2.1 beta 1
 (which is now in officially out in the 2.1.x branch)

 General improvements on counters in 2.1 are laid out here
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters

 As for best practice the answer is multiple tables for multiple query
 paths, or you can use something like solr or spark, take a look at the
 spark cassandra connector for a good way to count on lots of data from lots
 of different query paths
 https://github.com/datastax/spark-cassandra-connector.



 On Mon, Dec 22, 2014 at 9:22 PM, ziju feng pkdog...@gmail.com wrote:

 I just skimmed through JIRA
 https://issues.apache.org/jira/browse/CASSANDRA-4775 and it seems
 there has been some effort to make update idempotent. Perhaps the problem
 can be fixed in the near future?

 Anyway, what is the current best practice for such use case? (Counting
 and displaying counts in different queries) I don't need a 100% accurate
 count and strong consistency. Performance and application complexity is my
 main concern.

 Thanks

 On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla rsvi...@datastax.com
 wrote:

 You can cheat it by using the non counter column as part of your primary
 key (clustering column specifically) but the cases where this could work
 are limited and the places this is a good idea are even more rare.

 As for using counters in batches are already a not well regarded concept
 and counter batches have a number of troubling behaviors, as already stated
 increments aren't idempotent and batch implies retry.

 As for DSE search its doing something drastically different internally
 and the type of counting its doing is many orders of magnitude faster (
 think bitmask style matching + proper async 2i to minimize fanout cost)

 Generally speaking counting accurately while being highly available
 creates an interesting set of logical tradeoffs. Example what do you do if
 you're not able to communicate between two data centers, but both are up
 and serving likes quite happily? Is your counting down? Do you keep
 counting but serve up different answers? More accurately since problems are
 rarely data center to data center but more frequently between replicas, how
 much availability are you willing to give up in exchange for a globally
 accurate count?
 On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because
 currently the semantic of counter is only increment/decrement (thus NOT
 idempotent) and requires some special handling compared to other C* 
 columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column
 and standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of 
 item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter
 columns and non-counter columns in the same table, I have to propagate the
 current count from counter table to the main item table and index tables,
 so that like counts can be returned by those index tables without sending
 extra requests to counter table and DSE search is able to build index on
 like count column in the main item table to support like count related
 queries (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write will be 

Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread Ryan Svihla
Spark can count a regular table. Spark sql would be the easiest thing to
get started with most likely.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

Go down to the spark sql section to get some idea of the ease of use.
On Dec 22, 2014 10:00 PM, ziju feng pkdog...@gmail.com wrote:

 Thanks for the advise, I'll definitely take a look at how Spark works and
 how it can help with counting.

 One last question: My current implementation of counting is 1) increment
 counter 2) read counter immediately after the write 3) write counts to
 multiple tables for different query paths and solr. If I switch to Spark,
 do I still needs to use counter or counting will be done by spark on
 regular table?

 On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 increment wouldn't be idempotent from the client unless you knew the
 count at the time of the update (which you could do with LWT but that has
 pretty harsh performance), that particular jira is about how they're laid
 out and avoiding race conditions between nodes, which was resolved in 2.1
 beta 1 (which is now in officially out in the 2.1.x branch)

 General improvements on counters in 2.1 are laid out here
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters

 As for best practice the answer is multiple tables for multiple query
 paths, or you can use something like solr or spark, take a look at the
 spark cassandra connector for a good way to count on lots of data from lots
 of different query paths
 https://github.com/datastax/spark-cassandra-connector.



 On Mon, Dec 22, 2014 at 9:22 PM, ziju feng pkdog...@gmail.com wrote:

 I just skimmed through JIRA
 https://issues.apache.org/jira/browse/CASSANDRA-4775 and it seems
 there has been some effort to make update idempotent. Perhaps the problem
 can be fixed in the near future?

 Anyway, what is the current best practice for such use case? (Counting
 and displaying counts in different queries) I don't need a 100% accurate
 count and strong consistency. Performance and application complexity is my
 main concern.

 Thanks

 On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla rsvi...@datastax.com
 wrote:

 You can cheat it by using the non counter column as part of your
 primary key (clustering column specifically) but the cases where this could
 work are limited and the places this is a good idea are even more rare.

 As for using counters in batches are already a not well regarded
 concept and counter batches have a number of troubling behaviors, as
 already stated increments aren't idempotent and batch implies retry.

 As for DSE search its doing something drastically different internally
 and the type of counting its doing is many orders of magnitude faster (
 think bitmask style matching + proper async 2i to minimize fanout cost)

 Generally speaking counting accurately while being highly available
 creates an interesting set of logical tradeoffs. Example what do you do if
 you're not able to communicate between two data centers, but both are up
 and serving likes quite happily? Is your counting down? Do you keep
 counting but serve up different answers? More accurately since problems are
 rarely data center to data center but more frequently between replicas, how
 much availability are you willing to give up in exchange for a globally
 accurate count?
 On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because
 currently the semantic of counter is only increment/decrement (thus NOT
 idempotent) and requires some special handling compared to other C* 
 columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com
 wrote:

 ​I was wondering if there is plan to allow ​creating counter column
 and standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of 
 item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter
 columns and non-counter columns in the same table, I have to propagate 
 the
 current count from counter table to the main item table and index tables,
 so that like counts can be returned by those index tables without sending
 extra requests to counter table and DSE search is able to build index on
 like count column in the main item table to support like count related
 queries (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal
 table within a reasonable time (sub-seconds) currently is to read the
 current value from counter table right after the update. However it 
 suffers
 from several issues:
 1. Read-after-write may not