Re: Question about PartitionUpdate.singleRowUpdate()

2018-12-20 Thread Sam Klock
Thanks for the feedback.  I've opened:

https://issues.apache.org/jira/browse/CASSANDRA-14941

SK

On 2018-12-19 17:03, Jeff Jirsa wrote:
> Definitely worth a JIRA. Suspect it may be slow to get a response this
> close to the holidays, but a JIRA will be a bit more durable than the
> mailing list post.
> 
> 
> On Wed, Dec 19, 2018 at 1:58 PM Sam Klock  wrote:
> 
>> Cassandra devs,
>>
>> I have a question about the implementation of
>> PartitionUpdate.singleRowUpdate(), in particular the choice to use
>> EncodingStats.NO_STATS when building the resulting PartitionUpdate.  Is
>> there a functional reason for that -- i.e., is it safe to modify it to
>> use an EncodingStats built from deletionInfo, row, and staticRow?
>>
>> Context: under 3.0.17, we have a table using TWCS and a secondary index.
>> We've been having a problem with the sstables for the index lingering
>> essentially forever, despite the correlated sstables for the parent
>> table being removed pretty much when we expect them to.  We traced the
>> problem to the use of EncodingStats.NO_STATS in singleRowUpdate(), which
>> is being used to create the index updates when we write to the parent
>> table.  It appears that NO_STATS is making Cassandra think the memtables
>> for the index have data from September 2015 in them, which in turn
>> prevents it from dropping expired sstables (all of which are much more
>> recent than that) for the index.
>>
>> Experimentally, modifying singleRowUpdate() to build an EncodingStats
>> from its inputs (plus the MutableDeletionInfo it creates) seems to fix
>> the problem.  We don't have any insight into why the existing logic uses
>> NO_STATS, however, so we don't know if this change is really safe.  Does
>> it sound like we're on the right track?  (Also: I'm sure we'd be happy
>> to open an issue and submit a patch if this sounds like it would be
>> useful generally.)
>>
>> Thanks,
>> SK
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Question about PartitionUpdate.singleRowUpdate()

2018-12-19 Thread Sam Klock
Cassandra devs,

I have a question about the implementation of
PartitionUpdate.singleRowUpdate(), in particular the choice to use
EncodingStats.NO_STATS when building the resulting PartitionUpdate.  Is
there a functional reason for that -- i.e., is it safe to modify it to
use an EncodingStats built from deletionInfo, row, and staticRow?

Context: under 3.0.17, we have a table using TWCS and a secondary index.
We've been having a problem with the sstables for the index lingering
essentially forever, despite the correlated sstables for the parent
table being removed pretty much when we expect them to.  We traced the
problem to the use of EncodingStats.NO_STATS in singleRowUpdate(), which
is being used to create the index updates when we write to the parent
table.  It appears that NO_STATS is making Cassandra think the memtables
for the index have data from September 2015 in them, which in turn
prevents it from dropping expired sstables (all of which are much more
recent than that) for the index.

Experimentally, modifying singleRowUpdate() to build an EncodingStats
from its inputs (plus the MutableDeletionInfo it creates) seems to fix
the problem.  We don't have any insight into why the existing logic uses
NO_STATS, however, so we don't know if this change is really safe.  Does
it sound like we're on the right track?  (Also: I'm sure we'd be happy
to open an issue and submit a patch if this sounds like it would be
useful generally.)

Thanks,
SK

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Optimizing queries for partition keys

2018-05-08 Thread Sam Klock
Can someone please take a look at CASSANDRA-14415 when you have chance?
Getting a fix into a Cassandra release is not especially urgent for us,
but in lieu of that we would like to know whether it's safe to include
in our local build of Cassandra before attempting to deploy it.

Thanks,
SK

On 2018-04-24 14:16, Sam Klock wrote:
> Thanks.  For those interested: opened CASSANDRA-14415.
> 
> SK
> 
> On 2018-04-19 06:04, Benjamin Lerer wrote:
>> Hi Sam,
>>
>> Your finding is interesting. Effectively, if the number of bytes to skip is
>> larger than the remaining bytes in the buffer + the buffer size it could be
>> faster to use seek.
>> Feel free to open a JIRA ticket and attach your patch. It will be great if
>> you could add to the ticket your table schema as well
>>  as some information on your environment (e.g. disk type).
>>
>> On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <skl...@akamai.com> wrote:
>>
>>> Thanks (and apologies for the delayed response); that was the kind of
>>> feedback we were looking for.
>>>
>>> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
>>> addresses our problem in the sense that it does limit the data sent on
>>> the wire.  The performance is still extremely poor, however, due to the
>>> fact that Cassandra continues to read large volumes of data from disk.
>>> (We've also confirmed this behavior in 3.11.2.)
>>>
>>> With a bit more investigation, we now believe the problem (after
>>> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
>>> which appears to read bytes in order to skip them.  The subclass used in
>>> our case, RandomAccessReader, exposes a seek(), so we overrode
>>> skipBytes() in it to make use of seek(), and that seems to resolve the
>>> problem.
>>>
>>> This change is intuitively much safer than the one we'd originally
>>> identified, but we'd still like to confirm with you folks whether it's
>>> likely safe and, if so whether it's also potentially worth contributing.
>>>
>>> Thanks,
>>> Sk
>>>
>>>
>>> On 2018-03-22 18:16, Benjamin Lerer wrote:
>>>
>>>> You should check the 3.x release. CASSANDRA-10657 could have fixed your
>>>> problem.
>>>>
>>>>
>>>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <
>>>> benjamin.le...@datastax.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Syvlain explained the problem in CASSANDRA-4536:
>>>>> " Let me note that in CQL3 a row that have no live column don't exist, so
>>>>> we can't really implement this with a range slice having an empty columns
>>>>> list. Instead we should do a range slice with a full-row slice predicate
>>>>> with a count of 1, to make sure we do have a live column before including
>>>>> the partition key. "
>>>>>
>>>>> By using ColumnFilter.selectionBuilder(); you do not select all the
>>>>> columns. By consequence, some partitions might be returned while they
>>>>> should not.
>>>>>
>>>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <skl...@akamai.com> wrote:
>>>>>
>>>>> Cassandra devs,
>>>>>>
>>>>>> We use workflows in some of our clusters (running 3.0.15) that involve
>>>>>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>>>>>> observed extremely poor performance under light load (i.e., a small
>>>>>> number of rows per second and frequent timeouts), which we eventually
>>>>>> traced to replicas shipping entire rows (which in some cases could store
>>>>>> on the order of MBs of data) to service the query.  That surprised us
>>>>>> (partly because 2.1 doesn't seem to behave this way), so we did some
>>>>>> digging, and we eventually came up with a patch that modifies
>>>>>> SelectStatement.java in the following way: if the selection in the query
>>>>>> only includes the partition key, then when building a ColumnFilter for
>>>>>> the query, use:
>>>>>>
>>>>>>  builder = ColumnFilter.selectionBuilder();
>>>>>>
>>>>>> instead of:
>>>>>>
>>>>>>  builder = ColumnFilter.allColumnsBuilder();
>>>>>>
>>>>>> to initialize the ColumnFil

Re: Optimizing queries for partition keys

2018-04-24 Thread Sam Klock
Thanks.  For those interested: opened CASSANDRA-14415.

SK

On 2018-04-19 06:04, Benjamin Lerer wrote:
> Hi Sam,
> 
> Your finding is interesting. Effectively, if the number of bytes to skip is
> larger than the remaining bytes in the buffer + the buffer size it could be
> faster to use seek.
> Feel free to open a JIRA ticket and attach your patch. It will be great if
> you could add to the ticket your table schema as well
>  as some information on your environment (e.g. disk type).
> 
> On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <skl...@akamai.com> wrote:
> 
>> Thanks (and apologies for the delayed response); that was the kind of
>> feedback we were looking for.
>>
>> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
>> addresses our problem in the sense that it does limit the data sent on
>> the wire.  The performance is still extremely poor, however, due to the
>> fact that Cassandra continues to read large volumes of data from disk.
>> (We've also confirmed this behavior in 3.11.2.)
>>
>> With a bit more investigation, we now believe the problem (after
>> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
>> which appears to read bytes in order to skip them.  The subclass used in
>> our case, RandomAccessReader, exposes a seek(), so we overrode
>> skipBytes() in it to make use of seek(), and that seems to resolve the
>> problem.
>>
>> This change is intuitively much safer than the one we'd originally
>> identified, but we'd still like to confirm with you folks whether it's
>> likely safe and, if so whether it's also potentially worth contributing.
>>
>> Thanks,
>> Sk
>>
>>
>> On 2018-03-22 18:16, Benjamin Lerer wrote:
>>
>>> You should check the 3.x release. CASSANDRA-10657 could have fixed your
>>> problem.
>>>
>>>
>>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <
>>> benjamin.le...@datastax.com
>>>
>>>> wrote:
>>>>
>>>
>>> Syvlain explained the problem in CASSANDRA-4536:
>>>> " Let me note that in CQL3 a row that have no live column don't exist, so
>>>> we can't really implement this with a range slice having an empty columns
>>>> list. Instead we should do a range slice with a full-row slice predicate
>>>> with a count of 1, to make sure we do have a live column before including
>>>> the partition key. "
>>>>
>>>> By using ColumnFilter.selectionBuilder(); you do not select all the
>>>> columns. By consequence, some partitions might be returned while they
>>>> should not.
>>>>
>>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <skl...@akamai.com> wrote:
>>>>
>>>> Cassandra devs,
>>>>>
>>>>> We use workflows in some of our clusters (running 3.0.15) that involve
>>>>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>>>>> observed extremely poor performance under light load (i.e., a small
>>>>> number of rows per second and frequent timeouts), which we eventually
>>>>> traced to replicas shipping entire rows (which in some cases could store
>>>>> on the order of MBs of data) to service the query.  That surprised us
>>>>> (partly because 2.1 doesn't seem to behave this way), so we did some
>>>>> digging, and we eventually came up with a patch that modifies
>>>>> SelectStatement.java in the following way: if the selection in the query
>>>>> only includes the partition key, then when building a ColumnFilter for
>>>>> the query, use:
>>>>>
>>>>>  builder = ColumnFilter.selectionBuilder();
>>>>>
>>>>> instead of:
>>>>>
>>>>>  builder = ColumnFilter.allColumnsBuilder();
>>>>>
>>>>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>>>>> seems to repair the performance regression, and it doesn't appear to
>>>>> break any functionality (based on the unit tests and some smoke tests we
>>>>> ran involving insertions and deletions).
>>>>>
>>>>> We'd like to contribute this patch back to the project, but we're not
>>>>> convinced that there aren't subtle correctness issues we're missing,
>>>>> judging both from comments in the code and the existence of
>>>>> CASSANDRA-5912, which suggests optimizing this kind of query is
>>>>&

Re: Optimizing queries for partition keys

2018-04-17 Thread Sam Klock

Thanks (and apologies for the delayed response); that was the kind of
feedback we were looking for.

We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
addresses our problem in the sense that it does limit the data sent on
the wire.  The performance is still extremely poor, however, due to the
fact that Cassandra continues to read large volumes of data from disk.
(We've also confirmed this behavior in 3.11.2.)

With a bit more investigation, we now believe the problem (after
CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
which appears to read bytes in order to skip them.  The subclass used in
our case, RandomAccessReader, exposes a seek(), so we overrode
skipBytes() in it to make use of seek(), and that seems to resolve the
problem.

This change is intuitively much safer than the one we'd originally
identified, but we'd still like to confirm with you folks whether it's
likely safe and, if so whether it's also potentially worth contributing.

Thanks,
Sk

On 2018-03-22 18:16, Benjamin Lerer wrote:

You should check the 3.x release. CASSANDRA-10657 could have fixed your
problem.


On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <benjamin.le...@datastax.com

wrote:



Syvlain explained the problem in CASSANDRA-4536:
" Let me note that in CQL3 a row that have no live column don't exist, so
we can't really implement this with a range slice having an empty columns
list. Instead we should do a range slice with a full-row slice predicate
with a count of 1, to make sure we do have a live column before including
the partition key. "

By using ColumnFilter.selectionBuilder(); you do not select all the
columns. By consequence, some partitions might be returned while they
should not.

On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <skl...@akamai.com> wrote:


Cassandra devs,

We use workflows in some of our clusters (running 3.0.15) that involve
"SELECT DISTINCT key FROM..."-style queries.  For some tables, we
observed extremely poor performance under light load (i.e., a small
number of rows per second and frequent timeouts), which we eventually
traced to replicas shipping entire rows (which in some cases could store
on the order of MBs of data) to service the query.  That surprised us
(partly because 2.1 doesn't seem to behave this way), so we did some
digging, and we eventually came up with a patch that modifies
SelectStatement.java in the following way: if the selection in the query
only includes the partition key, then when building a ColumnFilter for
the query, use:

 builder = ColumnFilter.selectionBuilder();

instead of:

 builder = ColumnFilter.allColumnsBuilder();

to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
seems to repair the performance regression, and it doesn't appear to
break any functionality (based on the unit tests and some smoke tests we
ran involving insertions and deletions).

We'd like to contribute this patch back to the project, but we're not
convinced that there aren't subtle correctness issues we're missing,
judging both from comments in the code and the existence of
CASSANDRA-5912, which suggests optimizing this kind of query is
nontrivial.

So: does this change sound safe to make, or are there corner cases we
need to account for?  If there are corner cases, are there plausibly
ways of addressing them at the SelectStatement level, or will we need to
look deeper?

Thanks,
SK

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org








-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Optimizing queries for partition keys

2018-03-22 Thread Sam Klock
Cassandra devs,

We use workflows in some of our clusters (running 3.0.15) that involve
"SELECT DISTINCT key FROM..."-style queries.  For some tables, we
observed extremely poor performance under light load (i.e., a small
number of rows per second and frequent timeouts), which we eventually
traced to replicas shipping entire rows (which in some cases could store
on the order of MBs of data) to service the query.  That surprised us
(partly because 2.1 doesn't seem to behave this way), so we did some
digging, and we eventually came up with a patch that modifies
SelectStatement.java in the following way: if the selection in the query
only includes the partition key, then when building a ColumnFilter for
the query, use:

builder = ColumnFilter.selectionBuilder();

instead of:

builder = ColumnFilter.allColumnsBuilder();

to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
seems to repair the performance regression, and it doesn't appear to
break any functionality (based on the unit tests and some smoke tests we
ran involving insertions and deletions).

We'd like to contribute this patch back to the project, but we're not
convinced that there aren't subtle correctness issues we're missing,
judging both from comments in the code and the existence of
CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial.

So: does this change sound safe to make, or are there corner cases we
need to account for?  If there are corner cases, are there plausibly
ways of addressing them at the SelectStatement level, or will we need to
look deeper?

Thanks,
SK

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Reconciling expiring cells and tombstones

2015-06-18 Thread Sam Klock
Thanks Tyler.  Sounds like the risk of making this change is tolerable
for us.

Josef: thanks for the links.  We're thinking carefully about our NTP
configuration, but we also would like to account for unusual failure
modes (e.g., nodes with fast clocks experiencing partitions from NTP
servers/peers).  Making sure Cassandra's internals are arranged to
minimize the damage when these things happen is part of our strategy.

Thanks,
SK

On 06/17/2015 11:17 AM, Tyler Hobbs wrote:

 Why does Cassandra consistently prefer tombstones to other kinds of cells?

 
 It's primarily to have deterministic conflict resolution.  I don't recall
 any specific conversations about preferring tombstones expiring cells, and
 the original ticket (https://issues.apache.org/jira/browse/CASSANDRA-699)
 doesn't mention anything.
 
 By modifying this behavior in this particular case, do we risk hitting
 bizarre corner cases?

 
 I think this would be safe.  The only problem I can think of would happen
 while your cluster has some patched nodes and some unpatched nodes.  If you
 had any nodes that had both an expiring cell and a tombstone with the same
 timestamp, then patched replicas would return different results/digests
 than the unpatched nodes.  However, unless you're mixing TTLs with deletes,
 that's not too likely to happen.  Maybe repair combined with clock skew
 could result in that, but not much else.
 
 
 On Wed, Jun 17, 2015 at 10:05 AM, Josef Lindman Hörnlund jo...@appdata.biz
 wrote:
 

 Hello Sam,

 This is not answering your direct question but if you worry about clock
 skew take a look at this great two-part blogpost:


 https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/
 
 https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/


 https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/
 
 https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/



 Josef Lindman Hörnlund
 Chief Data Scientist
 AppData
 jo...@appdata.biz




 On 16 Jun 2015, at 20:45, Sam Klock skl...@akamai.com wrote:

 Hi folks,

 I have a question about a design choice on how expiring cells are
 reconciled with tombstones.  For two cells with the same timestamp, if
 one is expiring and one is a tombstone, Cassandra *always* prefers the
 tombstone.  This matches its behavior for normal/non-expiring cells, but
 the folks in my organization worry about what it may imply for nodes
 experiencing clock skew.  Specifically, we're concerned about scenarios
 like the following:

 1) An expiring cell is committed via some node with a non-skewed clock.
 2) Another replica for that cell experiences forward clock skew and
 decides that the cell is expired.  It eventually runs a compaction that
 converts the cell to a tombstone.
 3) The tombstone propagates to other nodes via, e.g., node repair.
 4) The other nodes all eventually run their own compactions.  Because of
 the reconciliation logic, the expiring cell is purged on all of the
 replicas, leaving behind only the tombstone.

 If the cell should have still been live at (4), the reconciliation logic
 will result in it being prematurely purged.  We have confirmed this
 behavior experimentally.

 My organization may be more concerned about clock skew than the larger
 community, so I don't think we're inclined to propose a patch at this
 time.  But to account for this kind of scenario we would like to patch
 our internal version of Cassandra to conditionally prefer expiring cells
 to tombstones if the node believes they should still be live; i.e., in
 reconcile() in *ExpiringCell.java, instead of:

if (cell instanceof DeletedCell)
return cell;

 use:

if (cell instanceof DeletedCell)
return isLive() ? this : cell;

 Before we do so, however, we'd like to understand the rationale for the
 existing behavior and the risks of making changes to it.  Why does
 Cassandra consistently prefer tombstones to other kinds of cells?  By
 modifying this behavior in this particular case, do we risk hitting
 bizarre corner cases?

 Thanks,
 SK


 
 


Reconciling expiring cells and tombstones

2015-06-16 Thread Sam Klock
Hi folks,

I have a question about a design choice on how expiring cells are
reconciled with tombstones.  For two cells with the same timestamp, if
one is expiring and one is a tombstone, Cassandra *always* prefers the
tombstone.  This matches its behavior for normal/non-expiring cells, but
the folks in my organization worry about what it may imply for nodes
experiencing clock skew.  Specifically, we're concerned about scenarios
like the following:

1) An expiring cell is committed via some node with a non-skewed clock.
2) Another replica for that cell experiences forward clock skew and
decides that the cell is expired.  It eventually runs a compaction that
converts the cell to a tombstone.
3) The tombstone propagates to other nodes via, e.g., node repair.
4) The other nodes all eventually run their own compactions.  Because of
the reconciliation logic, the expiring cell is purged on all of the
replicas, leaving behind only the tombstone.

If the cell should have still been live at (4), the reconciliation logic
will result in it being prematurely purged.  We have confirmed this
behavior experimentally.

My organization may be more concerned about clock skew than the larger
community, so I don't think we're inclined to propose a patch at this
time.  But to account for this kind of scenario we would like to patch
our internal version of Cassandra to conditionally prefer expiring cells
to tombstones if the node believes they should still be live; i.e., in
reconcile() in *ExpiringCell.java, instead of:

if (cell instanceof DeletedCell)
return cell;

use:

if (cell instanceof DeletedCell)
return isLive() ? this : cell;

Before we do so, however, we'd like to understand the rationale for the
existing behavior and the risks of making changes to it.  Why does
Cassandra consistently prefer tombstones to other kinds of cells?  By
modifying this behavior in this particular case, do we risk hitting
bizarre corner cases?

Thanks,
SK