Best practice for sorting on frequent updated column?

2014-12-27 Thread ziju feng
I need to sort data on a frequent updated column, such as like count of an
item. The common way of getting data sorted in Cassandra is to have the
column to be sorted on as clustering key. However, whenever such column is
updated, we need to delete the row of old value and insert the new one,
which not only can generate a lot of tombstones, but also require a
read-before-write if we don't know the original value (such as using
counter table to maintain the count and propagate it to the table that
needs to sort on the count).

I was wondering what is best practice for such use case? I'm currently
using DSE search to handle it but I would like to see a Cassandra only
solution.

Thanks.


Is compound index a planned feature in 3.0?

2014-12-26 Thread ziju feng
Compound index in MongoDB is really useful for qiery that involves
filtering/sorting on multiple columns. I was wondering if Cassandra 3.0 is
supposed to implement this feature.

When I read through JIRA, I only found feature like CASSANDRA-6048
https://issues.apache.org/jira/browse/CASSANDRA-6048, which allows using
multiple single column indexes in a query by joining predicates. Compound
index is more query driven and is closer to current application-maintained
index table, which may provide better performance than single column index
and can greatly simplify index maintenance during updates than index table.

Any idea?

Ziju


Re: Is compound index a planned feature in 3.0?

2014-12-26 Thread ziju feng
The global index JIRA actually mentions compound index but it seems that
there is no JIRA created for this feature? Anyway, I think I should wait
for 3.0 and see what does it bring to index. Thanks.

On Fri, Dec 26, 2014 at 6:09 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Many JIRA related to index are opened for 3.x

 Global indices: https://issues.apache.org/jira/browse/CASSANDRA-6477
 Functional index: https://issues.apache.org/jira/browse/CASSANDRA-7458
 Partial index: https://issues.apache.org/jira/browse/CASSANDRA-7391

 On Fri, Dec 26, 2014 at 10:49 AM, ziju feng pkdog...@gmail.com wrote:

 Compound index in MongoDB is really useful for qiery that involves
 filtering/sorting on multiple columns. I was wondering if Cassandra 3.0 is
 supposed to implement this feature.

 When I read through JIRA, I only found feature like CASSANDRA-6048
 https://issues.apache.org/jira/browse/CASSANDRA-6048, which allows
 using multiple single column indexes in a query by joining predicates.
 Compound index is more query driven and is closer to current
 application-maintained index table, which may provide better performance
 than single column index and can greatly simplify index maintenance during
 updates than index table.

 Any idea?

 Ziju





Store counter with non-counter column in the same column family?

2014-12-22 Thread ziju feng
​I was wondering if there is plan to allow ​creating counter column and
standard column in the same table.

Here is my use case:
I want to use counter to count how many users like a given item in my
application. The like count needs to be returned along with details of item
in query. To support querying items in different ways, I use both
application-maintained denormalized index tables and DSE search for
indexing. (DSE search is also used for text searching)

Since current counter implementation doesn't allow having counter columns
and non-counter columns in the same table, I have to propagate the current
count from counter table to the main item table and index tables, so that
like counts can be returned by those index tables without sending extra
requests to counter table and DSE search is able to build index on like
count column in the main item table to support like count related queries
(such as sorting by like count).

IMHO, the only way to sync data between counter table and normal table
within a reasonable time (sub-seconds) currently is to read the current
value from counter table right after the update. However it suffers from
several issues:
1. Read-after-write may not return the correct count when replication
factor  1 unless consistency level ALL/LOCAL_ALL is used
2. There are two extra non-parallelizable round-trips between the
application server and cassandra, which can have great impact on
performance.

If it is possible to store counter in standard column family, only one
write will be needed to update like count in the main table. Counter value
will also be eventually synced between replicas so that there is no need
for application to use extra mechanism like scheduled task to get the
correct counts.

A related issue is lifting the limitation of not allowing updating counter
columns and normal columns in one batch, since it is quite common to not
only have a counter for statistics but also store the details, such as
storing the relation of which user likes which items in my user case.

Any idea?


Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread ziju feng
I just skimmed through JIRA
https://issues.apache.org/jira/browse/CASSANDRA-4775 and it seems there
has been some effort to make update idempotent. Perhaps the problem can be
fixed in the near future?

Anyway, what is the current best practice for such use case? (Counting and
displaying counts in different queries) I don't need a 100% accurate count
and strong consistency. Performance and application complexity is my main
concern.

Thanks

On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla rsvi...@datastax.com wrote:

 You can cheat it by using the non counter column as part of your primary
 key (clustering column specifically) but the cases where this could work
 are limited and the places this is a good idea are even more rare.

 As for using counters in batches are already a not well regarded concept
 and counter batches have a number of troubling behaviors, as already stated
 increments aren't idempotent and batch implies retry.

 As for DSE search its doing something drastically different internally and
 the type of counting its doing is many orders of magnitude faster ( think
 bitmask style matching + proper async 2i to minimize fanout cost)

 Generally speaking counting accurately while being highly available
 creates an interesting set of logical tradeoffs. Example what do you do if
 you're not able to communicate between two data centers, but both are up
 and serving likes quite happily? Is your counting down? Do you keep
 counting but serve up different answers? More accurately since problems are
 rarely data center to data center but more frequently between replicas, how
 much availability are you willing to give up in exchange for a globally
 accurate count?
 On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because
 currently the semantic of counter is only increment/decrement (thus NOT
 idempotent) and requires some special handling compared to other C* columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column and
 standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter
 columns and non-counter columns in the same table, I have to propagate the
 current count from counter table to the main item table and index tables,
 so that like counts can be returned by those index tables without sending
 extra requests to counter table and DSE search is able to build index on
 like count column in the main item table to support like count related
 queries (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write will be needed to update like count in the main table. Counter value
 will also be eventually synced between replicas so that there is no need
 for application to use extra mechanism like scheduled task to get the
 correct counts.

 A related issue is lifting the limitation of not allowing updating
 counter columns and normal columns in one batch, since it is quite common
 to not only have a counter for statistics but also store the details, such
 as storing the relation of which user likes which items in my user case.

 Any idea?





Re: Store counter with non-counter column in the same column family?

2014-12-22 Thread ziju feng
Thanks for the advise, I'll definitely take a look at how Spark works and
how it can help with counting.

One last question: My current implementation of counting is 1) increment
counter 2) read counter immediately after the write 3) write counts to
multiple tables for different query paths and solr. If I switch to Spark,
do I still needs to use counter or counting will be done by spark on
regular table?

On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla rsvi...@datastax.com wrote:

 increment wouldn't be idempotent from the client unless you knew the count
 at the time of the update (which you could do with LWT but that has pretty
 harsh performance), that particular jira is about how they're laid out and
 avoiding race conditions between nodes, which was resolved in 2.1 beta 1
 (which is now in officially out in the 2.1.x branch)

 General improvements on counters in 2.1 are laid out here
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters

 As for best practice the answer is multiple tables for multiple query
 paths, or you can use something like solr or spark, take a look at the
 spark cassandra connector for a good way to count on lots of data from lots
 of different query paths
 https://github.com/datastax/spark-cassandra-connector.



 On Mon, Dec 22, 2014 at 9:22 PM, ziju feng pkdog...@gmail.com wrote:

 I just skimmed through JIRA
 https://issues.apache.org/jira/browse/CASSANDRA-4775 and it seems
 there has been some effort to make update idempotent. Perhaps the problem
 can be fixed in the near future?

 Anyway, what is the current best practice for such use case? (Counting
 and displaying counts in different queries) I don't need a 100% accurate
 count and strong consistency. Performance and application complexity is my
 main concern.

 Thanks

 On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla rsvi...@datastax.com
 wrote:

 You can cheat it by using the non counter column as part of your primary
 key (clustering column specifically) but the cases where this could work
 are limited and the places this is a good idea are even more rare.

 As for using counters in batches are already a not well regarded concept
 and counter batches have a number of troubling behaviors, as already stated
 increments aren't idempotent and batch implies retry.

 As for DSE search its doing something drastically different internally
 and the type of counting its doing is many orders of magnitude faster (
 think bitmask style matching + proper async 2i to minimize fanout cost)

 Generally speaking counting accurately while being highly available
 creates an interesting set of logical tradeoffs. Example what do you do if
 you're not able to communicate between two data centers, but both are up
 and serving likes quite happily? Is your counting down? Do you keep
 counting but serve up different answers? More accurately since problems are
 rarely data center to data center but more frequently between replicas, how
 much availability are you willing to give up in exchange for a globally
 accurate count?
 On Dec 22, 2014 6:00 AM, DuyHai Doan doanduy...@gmail.com wrote:

 It's not possible to mix counter and non counter columns because
 currently the semantic of counter is only increment/decrement (thus NOT
 idempotent) and requires some special handling compared to other C* 
 columns.

 On Mon, Dec 22, 2014 at 11:33 AM, ziju feng pkdog...@gmail.com wrote:

 ​I was wondering if there is plan to allow ​creating counter column
 and standard column in the same table.

 Here is my use case:
 I want to use counter to count how many users like a given item in my
 application. The like count needs to be returned along with details of 
 item
 in query. To support querying items in different ways, I use both
 application-maintained denormalized index tables and DSE search for
 indexing. (DSE search is also used for text searching)

 Since current counter implementation doesn't allow having counter
 columns and non-counter columns in the same table, I have to propagate the
 current count from counter table to the main item table and index tables,
 so that like counts can be returned by those index tables without sending
 extra requests to counter table and DSE search is able to build index on
 like count column in the main item table to support like count related
 queries (such as sorting by like count).

 IMHO, the only way to sync data between counter table and normal table
 within a reasonable time (sub-seconds) currently is to read the current
 value from counter table right after the update. However it suffers from
 several issues:
 1. Read-after-write may not return the correct count when replication
 factor  1 unless consistency level ALL/LOCAL_ALL is used
 2. There are two extra non-parallelizable round-trips between the
 application server and cassandra, which can have great impact on
 performance.

 If it is possible to store counter in standard column family, only one
 write

Plan to implement server side synchronization of denormalized data ?

2014-11-09 Thread ziju feng
Hi all,

I was wondering if there is any plan to support syncing change
automatically between entity table and tables that contain denormalized
data on server side?

I think many use cases in Cassandra require some level of denormalization.
However, there is currently little support for denormalization from server
side. Denormalization has to be done by driver or even application, which
leads to two issues:

1. Application complexity: As far as I know, there is no drivers support
propagating changes of main entity to denormalized ones, user will have to
handle data synchronization themselves. There can be a lot of codes to
write and it's quite hard to get it done right, considering things like
what consistency level to use, sync vs async update, reverse index table,
etc.

2. Data consistency: Suppose there is an entity table:
Create table entity(
  id text primary key,
  name text,
  value text)
and an index table for 'name', which also stores 'value' for
denormalization:
Create table name_idx(
  name text,
  id text,
  value text)
When a request to update 'value' is sent to the application, it needs to
update both entity and name_idx tables. Suppose another request to update
'name' is sent at the same time, the application will need to delete the
original row from name_idx and create a new row based on the new name.
However, if the 1st request read (it has to retrieve the value of 'name' in
order to update name_idx) before the 2nd request finishes, its update
statement will generate a row in name_idx with the original name, which
leads to inconsistent data. CAS may help here, but when the number of
concurrent requests is large and there are more index tables, CAS could
fail frequently.

Since secondary index has limitation in both performance and query
flexibility (no order by, for example), it can help the application a lot
if Cassandra support server maintained (just like the secondary index)
index tables on a main table.

One possible syntax can be 'CREATE VIEW view_name ON table_name' and assume
each column in the view would have the same name as in the main table as
convention, so that user can create different views based on their query
requirements.

Thanks,

Ziju


Document of WRITETIME function needs update

2014-09-16 Thread ziju feng
Hi,

I found that the WRITETIME function on counter column returns date/time in
milliseconds instead of microseconds, which is not mentioned in the document
http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use_writetime.html.
It will be helpful to clarify the difference in the document.

One side question: I denormalize the counter column value to regular tables
using read-after-write in QUORUM consistency from counter table and update
the regular tables using counter column's write time to resolve write
conflict. Is this a valid use case?

Thanks,

Ziju.


Re: Does the default LIMIT applies to automatic paging?

2014-06-25 Thread ziju feng
Thank you all for your answers and clarification.

The reason I mentioned the 1 rows LIMIT is not only because it is the
default LIMIT in cqlsh, but also because I found it on the CQL document
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html,
specifically the Specifying rows returned using LIMIT section. Perhaps
the document needs some updates to clarify a bit about what applies to the
drivers and what applies to cqlsh?


On Wed, Jun 25, 2014 at 12:21 AM, Sylvain Lebresne sylv...@datastax.com
wrote:

 On Tue, Jun 24, 2014 at 1:03 AM, ziju feng pkdog...@gmail.com wrote:


 I was wondering if the default 1 rows LIMIT applies to automatic
 pagination in C* 2.0 (I'm using Datastax driver).


 There is no 1 rows LIMIT in CQL. cqlsh does apply a default LIMIT if
 you don't provide for convenience sake, but it's a cqlsh thing. Therefore,
 there is no default limit with the java driver (neither with or without
 automatic pagination).

 --
 Sylvain




Re: Does the default LIMIT applies to automatic paging?

2014-06-24 Thread ziju feng
Does that mean the iterator will give me all the data instead of 1 rows?


On Mon, Jun 23, 2014 at 10:20 PM, DuyHai Doan doanduy...@gmail.com wrote:

 With the Java Driver,  set the fetchSize and use ResultSet.iterator
 Le 24 juin 2014 01:04, ziju feng pkdog...@gmail.com a écrit :

 Hi All,

 I have a wide row table that I want to iterate through all rows under a
 specific partition key. The table may contains around one million rows per
 partition

 I was wondering if the default 1 rows LIMIT applies to automatic
 pagination in C* 2.0 (I'm using Datastax driver). If so, what is best way
 to retrieve all rows of a given partition? Should I use a super large LIMIT
 value or should I manually page through the table?

 Thanks,

 Ziju




Does the default LIMIT applies to automatic paging?

2014-06-23 Thread ziju feng
Hi All,

I have a wide row table that I want to iterate through all rows under a
specific partition key. The table may contains around one million rows per
partition

I was wondering if the default 1 rows LIMIT applies to automatic
pagination in C* 2.0 (I'm using Datastax driver). If so, what is best way
to retrieve all rows of a given partition? Should I use a super large LIMIT
value or should I manually page through the table?

Thanks,

Ziju


Retrieve counter value after update

2014-05-29 Thread ziju feng
Hi All,

I was wondering if there is a planned feature in Cassandra to return the
current counter value after the update statement?

Our project is using counter column to count and since counter column
cannot reside in the same table with regular columns, we have to
denormalize the counter value as integer into other tables that need to
display the value.

Our current way of denormalization is to read the current value and
writetime from the counter table after the update and then batch update
other tables with the value and timestamp (to resolve wrtie conflict).

I don't know if this is a common requirement but I think if update to
counter table can return the current value and timestamp (or counter column
can reside in regular table in the first place), we can save this extra
read, which can reduce cluster load and update latency.

Thanks,

Ziju


Re: Data modeling for Pinterest-like application

2014-05-17 Thread ziju feng
I was thinking to use counter type a separate pin counter table and, when I
need to update the like count, I would use read-after-write to get the
current value and timestamp and then denormalize into pin's detail table and
board tables. 

Is it a viable solution in this case?

Thanks



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-modeling-for-Pinterest-like-application-tp7594481p7594539.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Data modeling for Pinterest-like application

2014-05-16 Thread ziju feng
Hello,

I'm working on data modeling for a Pinterest-like project. There are
basically two main concepts: Pin and Board, just like Pinterest, where pin
is an item containing an image, description and some other information such
as a like count, and each board should contain a sorted list of Pins.

The board can be modeled with primary key (board_id, created_at, pin_id)
where created_at is used to sort the pins of the board by date. The problem
is whether I should denormalize details of pins into the board table or
just retrieve pins by page (page size can be 10~20) and then multi-get by
pin_ids to obtain details.

Since there are some boards that are accessed very often (like the home
board), denormalization seems to be a reasonable choice to enhance read
performance. However, we then have to update not only the pin table be also
each row in the board table that contains the pin whenever a pin is
updated, which sometimes could be quite frequent (such as updating the like
count). Since a pin may be contained by many boards (could be thousands),
denormalization seems to bring a lot of load on the write side as well as
application code complexity.

Any suggestion to whether our data model should go denormalized or the
normalized/multi-get way which then perhaps need a separate cached layer
for read?

Thanks,

Ziju


Re: Data modeling for Pinterest-like application

2014-05-16 Thread ziju feng
Thanks for your answer, I really like the frequency of update vs read way of
thinking. 

A related question is whether it is a good idea to denormalize on read-heavy
part of data while normalize on other less frequently-accessed data? 

Our app will have a limited number of system managed boards that are viewed
by every user so it makes sense to denormalize and propagate updates of pins
to these boards. 

We will also have a like board for each user containing pins that they like,
which can be somewhat private and only viewed by the owner. 

Since a pin can be potentially liked by thousands of user, if we also
denormalize the like board, everytime that pin is liked by another user we
would have to update the like count in thousands of like boards. 

Does normalize work better in this case or cassandra can handle this kind of
write load?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-modeling-for-Pinterest-like-application-tp7594481p7594517.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


How to guarantee consistency between counter and materialized view?

2014-03-11 Thread ziju feng
Hi all,

Is there any way to guarantee a counter's value in materialized views,
which could be some other column families with different row keys and with
counter's value de-normalized, in sync with the value in its counter column
family?

Since batch can only work as either non-counter or counter-only batch, I
can't have updating the batch table and the materialized view tables in one
batch. In the case that the client fails right after updating the counter
table and right before updating the materialized views, there could be
inconsistent data.

Thanks,

Ziju