[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956701#comment-13956701 ] Benedict commented on CASSANDRA-6477: - New suggestion: Since we're performing read-before-write anyway with this suggestion, why not simply perform a _local only_ read-before-write on each of the nodes that owns the main record whilst writing the update - instead of issuing a complex tombstone, we simply issue a delete for whichever value is older on reconcile. Since we always CAS local updates, we will never get missed deletes, however we will issue redundant/duplicate deletes (RF many) - but they should be coalesced in memtable almost always, so it's a network cost only. There are probably tricks we can do to mitigate this cost, though, e.g. having each node (deterministically) pick two of the possible owners of the 2i entry to send the deletes it encounters to, to minimise replication of effort but also ensure message delivery to all nodes. Result is we keep compaction logic exactly the same, and we retain approximately the same consistency guarantees we currently have. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955527#comment-13955527 ] Jeremiah Jordan commented on CASSANDRA-6477: [~benedict] two threads update age = null. generate tombstones {{24, user1->null}}, two of them, so those are OK and not a problem, updated to the same value, we also need to generate {{null: user1}} as an append to the index. Then update age=25 generates tombstone {{null, user1->25}} and age=26 generates tombstone {{null, user1->26}}. Those two tombstones will be resolved on compaction/memtable clash, or when someone asks for age=null as a query. This will require keeping track of null columns in the index. Something similar would need to be done for a full delete of the row. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955399#comment-13955399 ] Jonathan Ellis commented on CASSANDRA-6477: --- That's why Sylvain said, it's "eventually consistent, but with no good user control about how eventual." > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955412#comment-13955412 ] Benedict commented on CASSANDRA-6477: - [~jjordan] is that in response to me? Because I don't see how this would work: if both deleted 24 and inserted 25 and 26, then we now have a record of both 25 and 26 mapping to user1, despite only one of them being true, and no means of tidying it up. So people can indefinitely look up on both values. This is only resolved if we look up the original record after every 2i result, which maybe was always the plan. I'm not sure. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955400#comment-13955400 ] Jeremiah Jordan commented on CASSANDRA-6477: If you have the race, you may briefly see the other value, but its a race, and it would be just like you read before update #2 happened, so as long as the period of time where you can get the "wrong" data is small, it is ok. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955397#comment-13955397 ] Benedict commented on CASSANDRA-6477: - bq. No, you resolve it in compaction or on lookup of "24". That only resolves deletes. How do you resolve *seeing the wrong data*? > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955394#comment-13955394 ] Jeremiah Jordan commented on CASSANDRA-6477: bq. I may be being dim here, but it seems to me that with this scheme you would need to write a reverse record of 25, user1->replaced 24, so when you lookup on 25, you can then read 24 and check there were no competing updates? Either that or read the original record, which sort of defeats the point of denormalisation... No, you resolve it in compaction or on lookup of "24". Compaction sees the two different tombstones for 24 and then resolves them to the correct new value, deleting the wrong new value. Or a look up of "24" pulls in the two tombstones, resolves them to the correct one, deletes the wrong one, and returns none to the user. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955384#comment-13955384 ] Jeremiah Jordan commented on CASSANDRA-6477: bq. I'll note that the idea above has the downside to be only eventually consistent, but with no good user control about how eventual (we're dependent on when read/compaction happen to "heal" the "denormalized index"). I think this might be OK, as this is really only an issue in the case of a race, so both tombstones will end up in meltables and be resolved immediately, or in sstables written near each other in time (which should hopefully compact together fairly quickly). In both cases resolving the conflict *should* happen fairly quickly, though there are probably edge cases. The issue I see here is that compaction now has to issue queries, and we need to make sure those deletes issue by compaction MUST happen, or else the index will get out of whack, and we will have already thrown out the extra tombstone. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955358#comment-13955358 ] Benedict commented on CASSANDRA-6477: - I may be being dim here, but it seems to me that with this scheme you would need to write a reverse record of 25, user1->replaced 24, so when you lookup on 25, you can then read 24 and check there were no competing updates? Either that or read the original record, which sort of defeats the point of denormalisation... > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955333#comment-13955333 ] Sylvain Lebresne commented on CASSANDRA-6477: - I'll note that the idea above has the downside to be only eventually consistent, but with no good user control about how eventual (we're dependent on when read/compaction happen to "heal" the "denormalized index"). > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955262#comment-13955262 ] Jonathan Ellis commented on CASSANDRA-6477: --- This does mean that a tombstone is not "just a tombstone," i.e., we will have to keep all tombstones of this time for gcgs or a similar period, not just "the most recent post-merge tombstone" as currently. But it should be relatively rare to have racing tombstones, so the penalty vs the status quo is not actually large in practice. /cc [~mstump] > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955260#comment-13955260 ] Jonathan Ellis commented on CASSANDRA-6477: --- Sylvain had a different idea: Instead of just writing a {{24, user1}} tombstone, write a tombstone that indicates what the value changed to: {{24, user1 -> 25}} for one thread, and {{24, user1 -> 26}} for the other. When the tombstones is merged for compaction or read, you can say "wait 2 people tried to erase that, one with 25 the other with 26, let's check which was has a higher timestamp and delete any obsolete entries." > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955258#comment-13955258 ] Jonathan Ellis commented on CASSANDRA-6477: --- bq. The problem is that this means we can't do lazy updates of the index; we need to keep the index perfectly (or, "eventually perfectly") in sync with the base table. To clarify: Suppose you have you index on the age of users, and we have an entry for {{24: user1}} in the index table. Now two threads update user1's age; one to 25, and one to 26. Each thread will # Read existing age # Delete index entry for existing age # Update user record and insert index entry for new age The problem is if each thread reads the existing age of 24, then we'll end up with both {{25: user1}} and {{26: user1} index entries. (Atomic batches do not help with this.) With normal indexes, we clean up stale entries at compaction + read time; we could still do this here but the performance penalty is a lot higher. Sylvain had another idea. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854548#comment-13854548 ] Jonathan Ellis commented on CASSANDRA-6477: --- The counterpoint is that we shouldn't require ~12 client codebases (if done by the driver) or 1000s (if done by app code) to invent this instead of the server. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854537#comment-13854537 ] Aleksey Yeschenko commented on CASSANDRA-6477: -- For the record, I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845819#comment-13845819 ] Jonathan Ellis commented on CASSANDRA-6477: --- The most straightforward approach is to take a similar approach to our local indexes: # At insert/update time, add a new index entry (as part of an atomic batch with the original update]), with the timestamp of the data cell # At read time, fetch the rows indicated by the index and remove stale index entries. Since we delete with the same timestamp as the index entry, this is safe wrt concurrent updates # We can still use compaction of the base table to clean out stale records, but this will now generate updates or hints to the index partition The big drawback is that reads require an O(N) multiget in the coordinator: reading the index entries is a single request, but then each row to fetch may be on a different replica. Put another way, this will give us indexes that are good at very high cardinality -- ideally a single row for each indexed value -- to go with our existing low-cardinality indexes, but we still have a hole for "medium cardinality" data. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (CASSANDRA-6477) Partitioned indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845818#comment-13845818 ] Jonathan Ellis commented on CASSANDRA-6477: --- Most application-maintained indexes solve this problem by denormalizing the base table row into the index entry. The problem is that this means we can't do lazy updates of the index; we need to keep the index perfectly (or, "eventually perfectly") in sync with the base table. Which in turns means we need to linearize updates to an indexed table. That was a performance hit but otherwise reasonable when we did that for local indexes; for partitioned indexes it's not feasible. I suppose we could punt and say "we'll give you a denormalized index but you have to swear that only one client will update any given row in that table at a time" which is actually a fairly common use case... but it does seem like the sort of thing that will bite the incautious user. Worse, it will appear to work but give subtly incorrect results. > Partitioned indexes > --- > > Key: CASSANDRA-6477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 > Project: Cassandra > Issue Type: New Feature > Components: API, Core >Reporter: Jonathan Ellis > Fix For: 3.0 > > > Local indexes are suitable for low-cardinality data, where spreading the > index across the cluster is a Good Thing. However, for high-cardinality > data, local indexes require querying most nodes in the cluster even if only a > handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.1.4#6159)