Re: CQL Map vs clustering keys

2017-11-16 Thread eugene miretsky
Thanks!

So assuming C* 3.0 and that my table stores only one collection, using
clustering keys will be more performant?

Extending this to sets - would doing something like this make sense?

(

 id UUID PRIMARY KEY,

val text,

PRIMARY KEY (id, val))

);

SELECT count(*) FROM TABLE WHERE id = 123 AND val = "test" // Key exists if
count != 0

On Wed, Nov 15, 2017 at 12:48 PM, Jon Haddad <j...@jonhaddad.com> wrote:

> In 3.0, clustering columns are not actually part of the column name
> anymore.  Yay.  Aaron Morton wrote a detailed analysis of the 3.x storage
> engine here: http://thelastpickle.com/blog/2016/03/04/
> introductiont-to-the-apache-cassandra-3-storage-engine.html
>
> The advantage of maps is a single table that can contain a very flexible
> data model, of maps and sets all in the same table.  Fun times.
>
> The advantage of using clustering keys is performance and you can use WAY
> more K/V pairs.
>
> Jon
>
>
> On Nov 15, 2017, at 8:12 AM, eugene miretsky <eugene.miret...@gmail.com>
> wrote:
>
> Hi,
>
> What would be the tradeoffs between using
>
> 1) Map
>
> (
>
> id UUID PRIMARY KEY,
>
> myMap map<int,text>
>
> );
>
> 2) Clustering key
>
> (
>
>  id UUID PRIMARY KEY,
>
> key int,
>
> val text,
>
> PRIMARY KEY (id, key))
>
> );
>
> My understanding is that maps are stored very similarly to clustering
> columns, where the map key is part of the SSTable's column name. The main
> difference seems to be that with maps all the key/value pairs get retrieved
> together, while with clustering keys we can retrieve individual rows, or a
> range of keys.
>
> Cheers,
> Eugene
>
>
>


CQL Map vs clustering keys

2017-11-15 Thread eugene miretsky
Hi,

What would be the tradeoffs between using

1) Map

(

id UUID PRIMARY KEY,

myMap map

);

2) Clustering key

(

 id UUID PRIMARY KEY,

key int,

val text,

PRIMARY KEY (id, key))

);

My understanding is that maps are stored very similarly to clustering
columns, where the map key is part of the SSTable's column name. The main
difference seems to be that with maps all the key/value pairs get retrieved
together, while with clustering keys we can retrieve individual rows, or a
range of keys.

Cheers,
Eugene


Re: How do TTLs generate tombstones

2017-10-31 Thread eugene miretsky
Thanks,

We have turned off read repair, and read with consistency = one. This
leaves repairs and old timestamps (generate by the client) as possible
causes for the overlap. We are writing from Spark, and don't have NTP set
up on the cluster - I think that was causing some of the issues, but we
have fixed it, and the problem remains.

It is hard for me to believe the C* repair has a bug, so before creating a
JIRA, I would appreciate if you could take a look at the attached sstables
(produced using sstablemetadata) from two different time points over the
last 2 week (we ran compaction between).

In both cases, there are sstables generated around 8 pm that span over very
long time periods (sometimes over a day). We run repair daily at 8 pm.

Cheers,
Eugene









On Wed, Oct 11, 2017 at 12:53 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
> should stream sections of (and/or possibly entire) sstables from one
> replica to another. Assuming the original sstable was entirely contained in
> a single time window, the resulting sstable fragment streamed to the
> neighbor node will similarly be entirely contained within a single time
> window, and will be joined with the sstables in that window. If you find
> this isn't the case, open a JIRA, that's a bug (it was explicitly a design
> goal of TWCS, as it was one of my biggest gripes with early versions of
> DTCS).
>
> Read repairs, however, will pollute the memtable and cause overlaps. There
> are two types of read repairs:
> - Blocking read repair due to consistency level (read at quorum, and one
> of the replicas is missing data, the coordinator will issue mutations to
> the missing replica, which will go into the memtable and flush into the
> newest time window). This can not be disabled (period), and is probably the
> reason most people have overlaps (because people tend to read their writes
> pretty quickly after writes in time series use cases, often before hints or
> normal repair can be successful, especially in environments where nodes are
> bounced often).
> - Background read repair (tunable with the read_repair_chance and
> dclocal_read_repair_chance table options), which is like blocking read
> repair, but happens probabilistically (ie: there's a 1% chance on any read
> that the coordinator will scan the partition and copy any missing data to
> the replicas missing that data. Again, this goes to the memtable, and will
> flush into the newest time window).
>
> There's a pretty good argument to be made against manual repairs if (and
> only if) you only use TTLs, never explicitly delete data, and can tolerate
> the business risk of losing two machines at a time (that is: in the very
> very rare case that you somehow lose 2 machines before you can rebuild,
> you'll lose some subset of data that never made it to the sole remaining
> replica; is your business going to lose millions of dollars, or will you
> just have a gap in an analytics dashboard somewhere that nobody's going to
> worry about).
>
> - Jeff
>
>
> On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
> spasupul...@netflix.com.invalid> wrote:
>
>> Hi Eugene,
>>
>> Common contributors to overlapping SSTables are
>> 1. Hints
>> 2. Repairs
>> 3. New writes with old timestamps (should be rare but technically
>> possible)
>>
>> I would not run repairs with TWCS - as you indicated, it is going to
>> result in overlapping SSTables which impacts disk space and read latency
>> since reads now have to encompass multiple SSTables.
>>
>> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would
>> not worry about data resurrection as long as all the writes carry TTL with
>> them.
>>
>> We faced similar overlapping issues with TWCS (it wss due to
>> dclocal_read_repair_chance) - we developed an SSTable tool that would give
>> topN or bottomN keys in an SSTable based on writetime/deletion time - we
>> used this to identify the specific keys responsible for overlap between
>> SSTables.
>>
>> Thanks,
>> Sumanth
>>
>>
>> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <
>> eugene.miret...@gmail.com> wrote:
>>
>>> Thanks Alain!
>>>
>>> We are using TWCS compaction, and I read your blog multiple times - it
>>> was very useful, thanks!
>>>
>>> We are seeing a lot of overlapping SSTables, leading to a lot of
>>> problems: (a) large number of tombstones read in queries, (b) high CPU
>>> usage, (c) fairly long Young Gen GC collection (300ms)
>>>
>>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>

Re: How do TTLs generate tombstones

2017-10-09 Thread eugene miretsky
Thanks Alain!

We are using TWCS compaction, and I read your blog multiple times - it was
very useful, thanks!

We are seeing a lot of overlapping SSTables, leading to a lot of problems:
(a) large number of tombstones read in queries, (b) high CPU usage, (c)
fairly long Young Gen GC collection (300ms)

We have read_repair_change = 0, and unchecked_tombstone_compaction =
true, gc_grace_seconds
= 3h,  but we read and write with consistency = 1.

I'm suspecting the overlap is coming from either hinted handoff or a repair
job we run nightly.

1) Is running repair with TWCS recommended? It seems like it will always
create a neverending overlap (the repair SSTable will have data from all 24
hours), an effect that seems to get amplified with anti-compaction.
2) TWCS seems to introduce a tradeoff between eventual consistency and
write/read availability. If all repairs are turned off, then the choice is
either (a) user strong consistency level, and pay the price of lower
availability and slowers reads or writes, or (b) use lower consistency
level, and risk inconsistent data (data is never repaired)

I will try your last link but reappearing data sound a bit scary :)

Any advice on how to debug this further would be greatly apprecaited.

Cheers,
Eugene

On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:

> Hi Eugene,
>
> If we never use updates (time series data), is it safe to set
>> gc_grace_seconds=0.
>
>
> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
> storage window size in Apache Cassandra.
>
> Yet time series data with fixed TTLs allows a very efficient use of
> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
> definitely give it a try.
>
> Here is a post from my colleague Alex that I believe could be useful in
> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>
> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
> 'max_hint_window_in_ms' should be really effective. Make sure to use a
> strong consistency level (generally RF = 3, CL.Read = CL.Write =
> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
> interest in consistency).
>
> This way you could expire entires SSTables, without compaction. If
> overlaps in SSTables become a problem, you could even consider to give a
> try to a more aggressive SSTable expiration https://issues.apache.org/
> jira/browse/CASSANDRA-13418.
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> 2017-10-05 23:44 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>
>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>> the table. If you are never doing updates and manual deletes and you always
>> insert with a ttl you can get away with setting it to the hinted handoff
>> period.
>>
>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eugene.miret...@gmail.com>
>> wrote:
>>
>>> Thanks Jeff,
>>>
>>> Make sense.
>>> If we never use updates (time series data), is it safe to set
>>> gc_grace_seconds=0.
>>>
>>>
>>>
>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>>
>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>>>> TTL'd cells, because even though the data is TTL'd, it may have been
>>>> written on top of another live cell that wasn't ttl'd:
>>>>
>>>> Imagine a test table, simple key->value (k, v).
>>>>
>>>> INSERT INTO table(k,v) values(1,1);
>>>> Kill 1 of the 3 nodes
>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>> when that dead node comes back to life, it needs to learn of the deletion.
>>>>
>>>>
>>>>
>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>> eugene.miret...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The following link says that TTLs generate tombstones -
>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>
>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>
>>>>>1. Is an actual new tombstone cell created when the TTL expires?
>>>>>2. Or, is the TTLed cell treated as a tombstone?
>>>>>
>>>>>
>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the
>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>
>>>
>


DataStax Spark driver performance for analytics workload

2017-10-06 Thread eugene miretsky
Hello,

When doing analytics is Spark, a common pattern is to load either the whole
table into memory or filter on some columns. This is a good pattern for
column-oriented files (Parquet) but seems to be a huge anti-pattern in C*.
Most common spark operations will result in one of (a) query without a
partition key (full table scan), (b) filter on a non-clustering key.
A naive implementation of the above will result in all SSTables being read
from disk multiple times in random order (for different keys) resulting in
horrible cache performance.

Does the DataStax driver do any smart tricks to optimize for this?

Cheers,
Eugene


Re: How do TTLs generate tombstones

2017-10-05 Thread eugene miretsky
Thanks Jeff,

Make sense.
If we never use updates (time series data), is it safe to set
gc_grace_seconds=0.



On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jji...@gmail.com> wrote:

>
> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
> TTL'd cells, because even though the data is TTL'd, it may have been
> written on top of another live cell that wasn't ttl'd:
>
> Imagine a test table, simple key->value (k, v).
>
> INSERT INTO table(k,v) values(1,1);
> Kill 1 of the 3 nodes
> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
> 60 seconds later, the live nodes will see that data as deleted, but when
> that dead node comes back to life, it needs to learn of the deletion.
>
>
>
> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <eugene.miret...@gmail.com
> > wrote:
>
>> Hello,
>>
>> The following link says that TTLs generate tombstones -
>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>
>> What exactly is the process that converts the TTL into a tombstone?
>>
>>1. Is an actual new tombstone cell created when the TTL expires?
>>2. Or, is the TTLed cell treated as a tombstone?
>>
>>
>> Also, does gc_grace_period have an effect on TTLed cells? gc_grace_period
>> is meant to protect from deleted data re-appearing if the tombstone is
>> compacted away before all nodes have reached a consistent state. However,
>> since the ttl is stored in the cell (in liveness_info), there is no way for
>> the cell to re-appear (the ttl will still be there)
>>
>> Cheers,
>> Eugene
>>
>>
>


How do TTLs generate tombstones

2017-10-04 Thread eugene miretsky
Hello,

The following link says that TTLs generate tombstones -
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.

What exactly is the process that converts the TTL into a tombstone?

   1. Is an actual new tombstone cell created when the TTL expires?
   2. Or, is the TTLed cell treated as a tombstone?


Also, does gc_grace_period have an effect on TTLed cells? gc_grace_period
is meant to protect from deleted data re-appearing if the tombstone is
compacted away before all nodes have reached a consistent state. However,
since the ttl is stored in the cell (in liveness_info), there is no way for
the cell to re-appear (the ttl will still be there)

Cheers,
Eugene


What is performance gain of clustering columns

2017-10-03 Thread eugene miretsky
Hi,

Clustering columns are used to order the data in a partition. However,
since data is split into SSTables, the rows are ordered by clustering key
only within each SSTable. Cassandra still needs to check all SSTables, and
merge the data if it is found in several SSTables. The only scanario where
I can imagine big performance gain is  super wide paritions, where each
partition is within a single SSTable (time series data, where partition
keys are time-buckets)

Has anybody done benchmarks on that and can share the data mode they have
used?

Cheers,
Eugene


Re: Downside to running multiple nodetool repairs at the same time?

2017-04-21 Thread eugene miretsky
The Spotify repo (https://github.com/spotify/cassandra-reaper) seems to not
be maintained anymore. I'm not sure if they even support Cassandra 3.0 (
https://github.com/spotify/cassandra-reaper/issues/140).

Regardless, in Cassandra 3.0 repairs are
1) Incremental, which means that the same SSTables will not be repaired
twice.
2) Parallel, which means that when you call repair, all nodes repair at the
same time.

I suppose that in the worst case, calling repair from X nodes could trigger
X repair processes (that will each trigger a Markel tree building on each
node). But I would assume that Cassandra prevents this by making sure that
there is only one repair process running per node.



On Fri, Apr 21, 2017 at 2:43 AM, Oskar Kjellin <oskar.kjel...@gmail.com>
wrote:

> It will create more overhead on your cluster. Consider using something
> like reaper to manage.
>
> > On 21 Apr 2017, at 00:57, eugene miretsky <eugene.miret...@gmail.com>
> wrote:
> >
> > In Cassandra 3.0 the default nodetool repair behaviour is incremental
> and parallel.
> > Is there a downside to triggering repair from multiple nodes at the same
> time?
> >
> > Basically, instead of scheduling a cron job on one node to run repair, I
> want to schedule the job on every node (this way, I don't have to worry
> about repair if the one node goes down). Alternatively, I could build a
> smarter solution for HA repair jobs, but that seems like an overkill.
>


Downside to running multiple nodetool repairs at the same time?

2017-04-20 Thread eugene miretsky
In Cassandra 3.0 the default nodetool repair behaviour is incremental and
parallel.
Is there a downside to triggering repair from multiple nodes at the same
time?

Basically, instead of scheduling a cron job on one node to run repair, I
want to schedule the job on every node (this way, I don't have to worry
about repair if the one node goes down). Alternatively, I could build a
smarter solution for HA repair jobs, but that seems like an overkill.


Re: Why are automatic anti-entropy repairs required when hinted hand-off is enabled?

2017-04-20 Thread eugene miretsky
Thanks Jayesh,

Watched all of those.

Still not sure I fully get the theory behind it

Aside from the 2 failure  cases I mentioned earlier, the only other way
data can become inconsistent  is error when replicating the data in the
background. Does Cassandra have a retry policy for internal replication? Is
there a setting to change it?





On Thu, Apr 6, 2017 at 10:54 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> I had asked a similar/related question - on how to carry out repair, etc
> and got some useful pointers.
>
> I would highly recommend the youtube video or the slideshare link below
> (both are for the same presentation).
>
>
>
> https://www.youtube.com/watch?v=1Sz_K8UID6E
>
>
>
> http://www.slideshare.net/DataStax/real-world-repairs-
> vinay-chella-netflix-cassandra-summit-2016
>
>
>
> https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/
>
>
>
> https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/
> toolsRepair.html
>
>
>
> https://www.datastax.com/dev/blog/repair-in-cassandra
>
>
>
>
>
>
>
>
>
> *From: *eugene miretsky <eugene.miret...@gmail.com>
> *Date: *Thursday, April 6, 2017 at 3:35 PM
> *To: *<user@cassandra.apache.org>
> *Subject: *Why are automatic anti-entropy repairs required when hinted
> hand-off is enabled?
>
>
>
> Hi,
>
>
>
> As I see it, if hinted handoff is enabled, the only time data can be
> inconsistent is when:
>
>1. A node is down for longer than the max_hint_window
>2. The coordinator node crushes before all the hints have been replayed
>
> Why is it still recommended to perform frequent automatic repairs, as well
> as enable read repair? Can't I just run a repair after one of the nodes is
> down? The only problem I see with this approach is a long repair job
> (instead of small incremental repairs). But other than that, are there any
> other issues/corner-cases?
>
>
>
> Cheers,
>
> Eugene
>


How to stress test collections in Cassandra Stress

2017-04-13 Thread eugene miretsky
Hi,

I'm trying to do a stress test on a a table with a collection column, but
cannot figure out how to do that.

I tried

table_definition: |
  CREATE TABLE list (
customer_id bigint,
items list,
PRIMARY KEY (customer_id));

columnspec:
  - name: customer_id
size: fixed(64)
population: norm(0..40M)
  - name: items
cluster: fixed(40)

When running the benchmark, I get: java.io.IOException: Operation x10 on
key(s) [27056313]: Error executing: (NoSuchElementException)


Why are automatic anti-entropy repairs required when hinted hand-off is enabled?

2017-04-06 Thread eugene miretsky
Hi,

As I see it, if hinted handoff is enabled, the only time data can be
inconsistent is when:

   1. A node is down for longer than the max_hint_window
   2. The coordinator node crushes before all the hints have been replayed

Why is it still recommended to perform frequent automatic repairs, as well
as enable read repair? Can't I just run a repair after one of the nodes is
down? The only problem I see with this approach is a long repair job
(instead of small incremental repairs). But other than that, are there any
other issues/corner-cases?

Cheers,
Eugene


Issues while using TWCS compaction and Bulkloader

2017-03-27 Thread eugene miretsky
Hi,

We have a Cassandra 3.0.8 cluster, and we use the Bulkloader

to upload time series data nightly. The data has a 3day TTL, and the
compaction window unit is 1h.

Generally the data fits into memory, all reads are served from OS page
cache, and the cluster works fine. However, we had a few unexplained
incidents:

   1. High page fault ratio: The happened ones, for 3-4 days and was
   resolved after we restarted the cluster. Have not been able to reproduce it
   since.
   2. High Bloom number of bloom filter false positive: Same as above.

Several questions:

   1. What could have caused the page fault, and/or bloom filter false
   positives?
   2. What's the right strategy for running repairs?
  1. Are repairs even required? We don't generate any tombstones.
  2. The following article suggests that incremental repairs should not
  be used with Date Tiered compactions, does it also apply to TWCS?
  
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesManualRepair.html

Cheers,
Eugene


When is anti-entropy repair required?

2017-03-27 Thread eugene miretsky
Hi,

Trying to get some clarifications on this post: https://docs.datastax.
com/en/cassandra/3.0/cassandra/operations/opsRepairNodesWhen.html

As far as I understand it, repairs to account for the fact that nodes could
go down (for short of long period of time)

The 2 main reasons for repairing are:

   1. To make sure date is consistent
   2. To make sure tombstones don't creep back

If I have a time series data model, with TWCS compaction where I never
update rows and hence don't care about either of the above (the whole
SSTable just expires after a few days ), do I even need to run repairs?