[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-07-23 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072241#comment-14072241
 ] 

Benedict commented on CASSANDRA-6696:
-

Linking CASSANDRA-7551 to remember to roll it back when we eventually merge 
this, as the prior default memtable_cleanup_threshold of 0.4 should be 
approximately optimal again.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-06-12 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029735#comment-14029735
 ] 

Brandon Williams commented on CASSANDRA-6696:
-

No, but I guess it wouldn't be too hard to add if we just advertised a list of 
affected vnodes only.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-06-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029724#comment-14029724
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Do we have per-vnode severity?  We want to blacklist just the affected vnodes.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-06-12 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029121#comment-14029121
 ] 

Brandon Williams commented on CASSANDRA-6696:
-

A simple approach would be for the node to increase its severity when it has 
blacklisted a disk.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-06-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028694#comment-14028694
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Not serving stale data is good, but warning the other nodes when we blacklist a 
disk to read those vnodes' data from other replicas would be even better.  New 
ticket for that?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-23 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007060#comment-14007060
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


Just pushed a version to 
https://github.com/krummas/cassandra/commits/marcuse/6696-4 - I'll spend some 
more time writing tests, but I figure it is ready for feedback now atleast.

* Flush to one sstable per disk:
** Split the total range in #disks parts
** Flush whole vnodes, if a vnode starts on a disk, it stays there. Note though 
that if a vnode wraps around the tokenspace, it will be split in 2 parts and be 
on different disks.
* SSTables flushed during startup will not get placed correctly since we don't 
yet know the local ranges.
* LeveledCompaction needs to know what ranges we have, calling startup() on the 
CompactionStrategy has been moved out of the CFS constructor
* LCS:
** One manifest per vnode, with a global L0.
** L1 is now aims to contain one sstable
** Same prios as before, first STCS in L0, then compactions in L1+, and last L0 
-> L1.
** STCS in L0 will create big per-disk files, not per-vnode ones.
* STCS:
** We now have L0 and L1, L1 contains per-vnode sstables, but within the 
vnode-sstables, we give no overlappiness-guarantees
** Compactions in L0 only include L0 sstables, and L1 compactions only include 
L1 compactions, all compactions end up as per-vnode sstables in L1
** When we get 4 sstables of similar size in L0, we will compact those, and 
create num_tokens L1 sstables.
** When one L1 vnode gets 4 sstables of similar size, it will compact those 
together
** L0 -> L1 compactions are prioritized over L1 -> L1 ones (though, these will 
run in parallel)
* Introduces originalFirst to keep track of the original first key of the 
sstable, we need this when figuring out which manifest the sstable belongs to 
during replace(..).
* If we get new ring version (i.e. we get a new token or lose one), we only 
reinitialize the LeveledManifestWrapper, this means that we might have sstables 
that start in one vnode, but does not end in it.
* "nodetool rebalancedata" will iterate over all sstables and make sure they 
are in the correct places.
* If a disk breaks/runs out of space we will flush/compact to the remaining 
disks


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989908#comment-13989908
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


bq.  flush to one L0 file per disk
yep, will do that first, then tweak and benchmark if more files are better

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989634#comment-13989634
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

So maybe, flush to one L0 file per disk?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989553#comment-13989553
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

It does, thanks.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989549#comment-13989549
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


not really, note that we take the memtable, split it in X parts and write those 
parts to disk;

for example, say we have 100 vnodes, meaning we have 100 L1 non-intersecting 
sstables, if we then flush one file that intersects with all those sstables, we 
would have to include all those files when we compact L0 -> L1. If we instead 
flush to 10 non-intersecting sstables in L0, we can do those L0 -> L1 
compactions independently, but the mutations are recompacted as many times

does that make sense?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989543#comment-13989543
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

I may be misunderstanding, but this seems to be optimizing for compaction 
throughput/parallelization, but at the expense of doing more total compaction 
activity (number of compactions per mutation over the life of that mutation, a 
form of write-amplification) by starting with smaller tables. 

If that's not the case, then please ignore, but it is important to note that 
for the largest scale, highest velocity, longest retained use cases, it's the 
number of recompactions/write amplification that really hurts.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989537#comment-13989537
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


bq.  we can gain a lot by not flushing too big L0 files.
if we flush to one big file, we would have to involve all L1 sstables when 
compacting L0 -> L1, if we flush smaller files, we can do more compactions in 
parallel and we don't have to wait for all ongoing L1 -> L2 compactions to 
finish before starting L0 -> L1

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989534#comment-13989534
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

bq. major compaction: include all files in compaction, write #vnodes files

Minor note: Doing it per-vnode would mean you don't have to wait for the entire 
dataset to finish before promoting some tmp to finished, and gets you "partial 
credit" if interrupted.

bq. here we can gain a lot by not flushing too big L0 files.

I'm not sure I follow that point.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989356#comment-13989356
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


summing up the discussion;

* one "stripe" is one vnode
* we flush to big files in L0, file per disk or perhaps group a bunch of vnodes 
together to increase the amount of parallel compactions we can do L0 -> L1

for STCS:
* we introduce L0 for STCS
* when we end up with a given number of overlapping L0 files (4), we compact 
those together and create per-vnode L1 files.
* major compaction: include all files in compaction, write #vnodes files

for LCS:
* We introduce a leveled manifest per vnode
* L0 is "global"
* when doing L0 -> L1 compactions, we end up with one file per involved 
vnode-stripe in L1, here we can gain a lot by not flushing too big L0 files.
* we still do STCS within L0 if we get too much data here, making sure we only 
compact overlapping files

anything i missed/misunderstood?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987869#comment-13987869
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. Right, I assumed we're going to be doing STCS/LCS w/in vnode boundaries. No?

Yes, just checking I wasn't misunderstanding what you were saying. I think this 
sounds roughly in line with what I was suggesting in that case.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987865#comment-13987865
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Right, I assumed we're going to be doing STCS/LCS w/in vnode boundaries.  No?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987860#comment-13987860
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. That doesn't apply to us

It might apply more with the proliferation of composite keys. I would like to 
see our compaction strategies make more use of this information eventually, and 
these are ordered.

bq. That doesn't apply to us, but the range-based stripes are basically exactly 
the same as our vnodes here.

So you mean to apply it only as far as the vnode boundaries and then switch to 
STCS/LCS?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987835#comment-13987835
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

They are basically splittable and resizable vnodes if you were to use shuffled 
vnodes with a byte ordered partitioner. Which makes them have more in common 
with CQL partitions than with vnodes,  from a "range of data" point of view.  
Except that the size of the ranges don't vary with the data model like they do 
with Cassandra. 

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987837#comment-13987837
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

Hbase actually has pluggable compaction strategies these days. 

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987829#comment-13987829
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

HBase doesn't have compaction strategies per se, but you can think of this as 
an extension of their STCS strategy.  Still, I don't see any reason why it 
can't apply across the board for us.

bq. It's worth noting that their stripes are not based on vnodes, but on the 
distribution of the data present, with merging/splitting as a given range gets 
too small/big.

Sort of.  They have a special case where you can do "size based stripes" for 
workloads where you have mostly-increasing keys.  (Remember that hbase uses an 
ordered partitioner.)  That doesn't apply to us, but the range-based stripes 
are basically exactly the same as our vnodes here.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987817#comment-13987817
 ] 

Benedict commented on CASSANDRA-6696:
-

I'd note that it appears what they're doing is really a different compaction 
strategy - the approach is not dissimilar to what I'm suggesting here for our 
L0 only, and it may be that we could/should implement it generally, but I think 
the two are slightly orthogonal tasks (since we're no doubt going to be keeping 
LCS around). It's worth noting that their stripes are not based on vnodes, but 
on the distribution of the data present, with merging/splitting as a given 
range gets too small/big.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987794#comment-13987794
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

It looks like what HBase does (as of fairly recently is) is flush to a single 
file, then break it up into sub-regions/stripes/vnodes when compacting 
(multiple L0 files) with L1+.

https://issues.apache.org/jira/secure/attachment/12576005/Stripe%20compactions.pdf

https://issues.apache.org/jira/browse/HBASE-7667


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987591#comment-13987591
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. only merge with the individual L1s once the density of the relevant portion 
of L0 is > ~0.5 per vnode

I mean when the amount of data we would flush into the next level would on 
average be data equal to 50% of the size limit of the lower level. But that is 
too high (see below)

bq. current default size is 160M

I was reading stale docs that set it at 5Mb. Somewhere inbetween seems sensible 
- 20Mb? That way we'd get 1.6Gb into 80 files; if we have 768 vnodes and we set 
the ratio for flushing down into the lower level at 0.1 we'd _on average_ merge 
straight into L1, but in reality this would only happen for those vnodes with 
sufficient density, and those without would pause until sufficient density of 
data appeared.

The only slight complication to this is what we do if there then become files 
containing enough data to get merged into one L1, but another portion is much 
too small to be efficient to merge down - in this case I'd suggest simply 
remerging out the data that would be inefficient to merge into L0, until it 
hits our merge threshold (or is >= in size to the data already present in L1, 
if L1 is not very full). Alternatively we could, for simplicity, simply always 
merge as soon as the average for any file exceeds our threshold, but I'm not 
convinced this is a great strategy.



> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-01 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987427#comment-13987427
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


bq. flush to L0 we write files equivalent in size to one L1 file, grouping 
however many vnodes fit in the flush
just checking that i get it: one sstable in L1 is one vnode, current default 
size is 160M, we would flush a 1.6G memtable into 10 L0 files?

bq. only merge with the individual L1s once the density of the relevant portion 
of L0 is > ~0.5 per vnode
could you elaborate?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986845#comment-13986845
 ] 

Benedict commented on CASSANDRA-6696:
-

Or (somewhat handwavy, just to give a basic outline of the idea): we could say 
each vnode has its own LCS hierarchy - this is optimal from a read perspective 
- and perhaps have L1 switch to 1 file in size by default (L2 being 10, etc), 
and then for our flush to L0 we write files equivalent in size to one L1 file, 
grouping however many vnodes fit in the flush, and then only merge with the 
individual L1s once the density of the relevant portion of L0 is > ~0.5 per 
vnode

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-01 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986799#comment-13986799
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


this special cases compaction a bit though, we could have sstables that overlap 
with other sstables of similar size that we can't really compact together 
(which we probably shouldn't since they overlap too little (CASSANDRA-6474)).

for LCS i guess we could align the vnode start/end to the sstables start/end. 
Ie, in level 1 (10 sstables) each sstable would contain ~100 vnodes, in level2 
(100 sstables) ~10, and in level3 (1000 sstables) 1 vnode. Then we could flush 
sstables mapping to the sstables in level1 to only compact those together.


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986700#comment-13986700
 ] 

Benedict commented on CASSANDRA-6696:
-

I think we may be able to get a good half-way house by setting a minimum 
sstable size below which we aggregate vnodes into a single sstable, ensuring we 
always keep a whole vnode in one table (unless that vnode is larger than the 
maximum sstable size, in which case we split it, and it alone) - this should be 
cost free and tend rapidly towards separate sstables per vnode for all but the 
most recent data, which could simply ALL be copied over to any nodes we want to 
duplicate data to, as the overhead would be approximately constant regardless 
of the amount of data the node is managing. We could introduce a tool to split 
out a single token range from those files for users who wanted to avoid this 
fixed overhead cost.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-05-01 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986688#comment-13986688
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


Pushed a semi-working sstable-per vnode version here: 
https://github.com/krummas/cassandra/commits/marcuse/6696-3 (by no means 
review-ready)

* flushes to vnode-separate sstables, spread out over the disks available
* keeps the sstables separate during compaction, for STCS by grouping the 
compactionbuckets by overlapping sstables, and with LCS by keeping a separate 
manifest for every vnode.

Still quite broken, but i think good enough to evaluate if we want to go this 
way, drawback is mainly that it takes a looong time to flush to 768 sstables 
instead of one (768 = num_tokens=256 and rf = 3). Doing 768 parallel 
compactions is also quite heavy. 

Unless anyone has a brilliant idea how to make flushing and compaction less 
heavy, I think we need some sort of balance here, maybe grouping the vnodes (8 
or 16 vnodes per sstable perhaps?) so that we flush a more reasonable amout of 
sstables, or even just going with the per-disk approach? 

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977165#comment-13977165
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. Okay, but I think that's clearly a different ticket. In the meantime, 
sstable-per-vnode has a lot of advantages.

Agreed, it's CASSANDRA-7032 :-)

But I guess what I'm saying is let's hold off knapsacking and rebalancing, as 
that's a lot of added complexity to this ticket, and we can probably fix it 
more easily with CASSANDRA-7032.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977158#comment-13977158
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Okay, but I think that's clearly a different ticket.  In the meantime, 
sstable-per-vnode has a lot of advantages.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977156#comment-13977156
 ] 

Benedict commented on CASSANDRA-6696:
-

Adjacent N is another way of saying all vnodes assigned to a given real node 
but within a contiguous range of the total token range.

I only care about it because I think it is tractable to create a vnode 
allocation algorithm that fits this bill (I already got a naive approach 
working almost well enough when I hacked around for a couple of hours, I'm sure 
a much more optimal algorithm is within our grasp if we put a bit of thought 
into it)

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977154#comment-13977154
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Why do we care about adjacent N?  When new nodes join they will choose random 
tokens.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977125#comment-13977125
 ] 

Benedict commented on CASSANDRA-6696:
-

I may be misunderstanding your proposal: I assume you mean assign the vnodes 
_to each disk_ via knapsack? In which case your balance per disk is based 
solely on the knapsack. If the _cluster wide_ vnode allocation is designed 
specifically to ensure that any given range will maintain the property I gave 
(i.e. that any N will be within some proportion of the ideal ownership 
proportion) then the balance is based on that and will continue to be true no 
matter how many nodes are added to the cluster, whereas you will have to 
re-knapsack each time the ownership range changes.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977118#comment-13977118
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

I think if you draw it out on paper you'll see that "assign vnodes via 
knapsack, then steal some token ranges" and "divide into M token ranges, then 
steal some" work out to about the same imbalance post-bootstrap.  Am I 
misunderstanding what you are proposing?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977117#comment-13977117
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

True. I merely mean to say that the problem doesn't get horrible at extreme 
scale. You could also optimize for rapid additions by deferring rebalancing 
until all nodes are added (nodetool disablebalancing, nodetool 
enablebalancing), or some such. Still not arguing for it, though.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977112#comment-13977112
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. if a new node "steals" from a range that intersects disk X but not disk Y, 
you're going to end up with more imbalance post-bootstrap than you had before.

Sure, it will steal an amount, but if the allocation of new vnodes ensures that 
any stealing happens equally distributed across the cluster then while any 
single node will cause an imbalance, the total imbalance of the cluster is kept 
bounded throughout an arbitrary number of node additions. So that you never get 
perfection, but you're never far from it either. The basic idea is that while 
you cannot easily guarantee the size of any single vnode, you _can_ guarantee 
that if you collect any N _adjacent_ vnodes together that their total owned 
range is within some proportion of the ideal. As N grows the proximity to 
perfect increases.

bq. there is a finite cap on the amount of work needed to be performed per node 
addition

Sure, but that's a reasonably large cap - for all clusters with fewer than 256 
nodes my statement holds true



> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977091#comment-13977091
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

But I don't think yours accomplishes that either.  No matter how you allocate 
token ranges across disk, if a new node "steals" from a range that intersects 
disk X but not disk Y, you're going to end up with more imbalance 
post-bootstrap than you had before.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977089#comment-13977089
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

bq. or we probably need to have a dynamic allocation strategy, and the problem 
with that is that when the token range gets redistributed by node 
additions/removals, the whole cluster suddenly needs to start kicking off 
rebalancing of their local disks.
A node addition will add 256 vnodes to the ring. Unless I misunderstand, this 
will be DC-local resizing of vnodes, and that if the cluster is huge, there 
will still only be 256 (times RF?) different resize operations that have to 
take place in that DC. So there is a finite cap on the amount of work needed to 
be performed per node addition (and presumably removals), and that cap is 
actually bounded by vnodes per node, and not by cluster size.
If true, then Jonathan's solution feels good enough, since the upper bound is 
reasonably constrained. Not saying I wouldn't prefer doing less overall work, 
though.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977049#comment-13977049
 ] 

Benedict commented on CASSANDRA-6696:
-

You have to rerun your knapsack algorithm after each change of cluster token 
allocation to avoid getting very skewed distribution across the disks. What I'm 
suggesting is allocating vnodes in a way that is designed to deterministically 
allow groupings that never need rebalancing

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977040#comment-13977040
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

How's that?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977037#comment-13977037
 ] 

Benedict commented on CASSANDRA-6696:
-

Adding or removing a node becomes an operation proportional in size to the 
number of nodes in the cluster is my issue with that

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977029#comment-13977029
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

Agreed. Seems quite sufficient for this problem.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977011#comment-13977011
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

To clarify: vnodes are not equal in size but they are proportional to token 
distance, again with the exception of BOP.  So we can easily do a knapsack 
problem across the local disks on first startup.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976751#comment-13976751
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

bq. Well the assumption is broken then.
Yes, very true, and I've been thinking for a while now that, while we don't 
need a strategy to keep all vnodes the exact same size, we would benefit from a 
background process that gradually splits and combines the largest and smallest 
outliers to have vnodes *tend* to converge on the same size.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976744#comment-13976744
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

bq. We can add subdirectory-per-vnode if necessary, but aren't modern FS 
capable of dealing with hundreds of thousands of files per directory?

Exactly my thinking.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976724#comment-13976724
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. assuming that every vnode is of roughly equal side is a core part of 
consistent hashing.

Well the assumption is broken then. I can assure you vnodes are not of equal 
size, especially not with our current allocation strategy, and getting them to 
be of equal size is kind of tough. We may be able to improve that, though. 

I'm not sure how what I'm suggesting can't also provide most of these other 
benefits, however we can bring the two approaches closer by simply saying all 
vnodes starting within the first 1/DISK portion of the token range are 
allocated to the first disk, and so on - and then they're pretty similar. But 
the unequal size of vnodes means any compaction "tuning" will have limited 
impact, and probably induce more random IO

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976718#comment-13976718
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

bq. With good OS tuning, I'm not scared of too many sstables

We can add subdirectory-per-vnode if necessary, but aren't modern FS capable of 
dealing with hundreds of thousands of files per directory?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976714#comment-13976714
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

bq. either we need to ensure that all vnodes are of roughly equal size (very 
difficult), or we probably need to have a dynamic allocation strategy

Why is the first option "very difficult"?  BOP aside (and the consensus was, we 
can continue supporting that because its users are willing to live with its 
limitations), assuming that every vnode is of roughly equal side is a core part 
of consistent hashing.

"M distinct chunks" gives you the worst of both worlds.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976712#comment-13976712
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

bq. doing per-vnode sstables could enable some nice benefits, like turning off 
the exact vnodes that are affected by a disk failure or a mini auto-repair on 
corrupt sstables perhaps?

CASSANDRA-4784 lists some other benefits, the strongest of which I think are

# on disk failure, we can invalidate the affected vnodes and repair them, 
rather than continuing to serve incomplete data or halting the entire node 
[similar to what you are saying here]
# we can deduplicate ranges for bulk load into another cluster (CASSANDRA-4756)

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976688#comment-13976688
 ] 

Benedict commented on CASSANDRA-6696:
-

The problem here is packing vnodes fairly across the disks: either we need to 
ensure that all vnodes are of roughly equal size (very difficult), or we 
probably need to have a dynamic allocation strategy, and the problem with 
_that_ is that when the token range gets redistributed by node 
additions/removals, the whole cluster suddenly needs to start kicking off 
rebalancing of their local disks.

We could support splitting the token range into M distinct chunks, where M is 
preferably some multiple of the number of disks, and split the total token 
range into M chunks, then allocate each chunk to a disk in round-robin fashion. 
This then remains deterministic, and it is I think easier to guarantee an even 
distribution within a given token range than it is to guarantee all vnodes are 
of equal size, whilst still supporting a dynamic cluster size. Even here, 
though, realistically I think we need the number of chunks to be quite a bit 
smaller than the number of vnodes to guarantee anything approaching balance of 
these chunks.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976672#comment-13976672
 ] 

Tupshin Harper commented on CASSANDRA-6696:
---

+1. To the extent that we can do sstables per vnode without introducing other 
performance costs, I am hugely in favor of it. With good OS tuning, I'm not 
scared of too many sstables. If it is a pain for backup, or other things, you 
could have an offline sstable consolidator script that would take a batch of 
sstables and stream them out as a single sstable to a remote location.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-22 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976513#comment-13976513
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


I don't think it would simplify things much (this is quite simple already), but 
doing per-vnode sstables could enable some nice benefits, like turning off the 
exact vnodes that are affected by a disk failure or a mini auto-repair on 
corrupt sstables perhaps?

The drawback I see is that we would end up with very many sstables, making it a 
real pita to do backups etc.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-21 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976394#comment-13976394
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

would it simplify things to make it per-vnode?

thinking we'd also get more compaction benefit that way...  at the expense of 
doing more random-ish io on flush, but (1) this will be mitigated by larger 
off-heap memtables in 2.1 and (2) we could tune compaction vs io by adjusting 
number of vnodes, instead of being stuck w/ the disk count.

/cc [~tupshin]

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-19 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974766#comment-13974766
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


8

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-19 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974764#comment-13974764
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

If I have 256 vnodes and 8 disks, will a flush write 256 sstables or 8?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-15 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969333#comment-13969333
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


pushed a new version to 
https://github.com/krummas/cassandra/commits/marcuse/6696-3 which;

* adds nodetool command to rebalance data over disks so that user can do this 
whenever they want (like after manually adding sstables to the data directories)
* removes diskawarewriter from everything but streams and the rebalancing 
command
* makes the flush executor an array of executors.
* splits ranges based on total partitioner range and makes this feature 
vnodes-only
* supports the old way of doing things for non-vnodes setup (and ordered 
partitioners)

there are still some of my config-changes left in as i bet there will be more 
comments on this

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-13 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968117#comment-13968117
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


In those cases I think it would be better for the user to just create a JBOD 
configuration over those drives

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-13 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968110#comment-13968110
 ] 

Jeremy Hanna commented on CASSANDRA-6696:
-

Do you mean dropping support for a separate flush directory for JBOD 
configurations or generally?  Wouldn't it still have significant performance 
benefits in non-JBOD environments?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965418#comment-13965418
 ] 

Benedict commented on CASSANDRA-6696:
-

+1 on dropping separate flush dir. This is a better solution IMO - get full 
parallelism of the disks available.

bq. do you mean having a background job move data around after upgrade

Yes, I think this would be preferable. Blocking at startup would make a rolling 
upgrade much too painful. If we mark all old sstables as compacting at startup, 
we can safely rewrite them in the background, and not worry about them 
violating our assumptions/constraints, since they're not eligible for regular 
compaction.


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-10 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965376#comment-13965376
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


btw, being able to not care about locations while compacting means we can't 
really keep having a separate flush directory, since the data flushed to a 
directory will stay there forever, wdyt, is it worth keeping flush directories 
and DiskAwareWriter everywhere or should we drop support for separate flush 
dir? With flushing being spread out on all disks, the advantages of having a 
separate flush dir are not as big.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-04 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959757#comment-13959757
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


[~benedict] do you mean having a background job move data around after upgrade? 
Or hanging on startup and rewriting everything?

Current version would end up with data on the correct disks eventually with 
compactions, but I agree it would be nice to be able to just care about the 
disks when flushing and streaming. Manually copying sstables into the datadirs 
and calling 'nodetool refresh' would also need some care.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958664#comment-13958664
 ] 

Benedict commented on CASSANDRA-6696:
-

A further suggestion: whilst we know vnodes don't currently distribute 
perfectly, this would be much simpler and more robust if we said that each disk 
simply gets assigned 1/#disks contiguous portion of the total (global) token 
range. This way, once we migrate to the new layout we _never have to worry 
about it again_. As things stand, any addition or removal of a single node, or 
change in RF, triggers a need to rewrite _the entire cluster_. Whilst this does 
ensure even distribution acriss the disks, this seems like we leave some major 
holes in the protection we're offering, and filling them may be error prone 
(and certainly costly).

So, my suggestion is that we permit this feature only for vnodes. We can, at 
the same time, perhaps visit the question of more deterministically allocating 
vnode ranges so that the cluster is evenly distributed.

[~kohlisankalp], what do you think?


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958647#comment-13958647
 ] 

Benedict commented on CASSANDRA-6696:
-

Just a suggestion (not 100% certain it is better, but it seems cleaner to me):

Once this feature is activated by the user, it might be easier to have an 
upgrade period during which sstables are migrated using DiskAwareWriter, but 
after which we know that the constraints hold. This would allow us to mostly 
leave the code unchanged in a few places (e.g. scrubber, compactiontask) which 
are already (prior to this ticket) a little on the complex side. It also seems 
like it would be easier to reason about behaviour in the future if we know 
these constraints are safely imposed, whereas using DiskAwareWriter leaves you 
with the impression we're never quite sure if the files obey our constraints or 
not.

Really it's not a major issue, but worth considering.

One other minor thing (more certain about this one though): perDiskExecutor 
should be an array of executors, one per disk; any configurable parallelism 
then should affect the number of threads each executor is given. Otherwise 
could get uneven distribution of work to the disks (especially as we add tasks 
in disk order, so if multiple tasks get queued at once, we'll get clumping of 
tasks by disk, reducing throughput on some disks through over-utilisation, and 
under-utilising the others.)

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-01 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956657#comment-13956657
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


bq. Why is it partitioner-dependent?
if we own all tokens we need to know min and max tokens for the partitioner to 
be able to split them over the disks.

javadoc added for splitRanges in the repo above

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-01 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956568#comment-13956568
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Can you add javadoc for splitranges?  Why is it partitioner-dependent?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-04-01 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956205#comment-13956205
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


pushed a new version to 
https://github.com/krummas/cassandra/commits/marcuse/6696-2

* removed SSTWInterface, instead created a helper class that is reused in most 
places
* multithreaded flush, one thread per disk
* support multiple flush dirs
* sort compaction/flush dirs lexicographically to make sure we always put the 
same tokens on the same disks (even if you rearrange dirs in config etc)
* avoids compaction loops by making sure we never start STCS compactions with 
any sstables that don't intersect (which the sstables on different disks wont)
* RandomP and Murmur3P supported, the rest will dump data on the first disk for 
now

TODO:
* ask user@ for remove-OPP/BOP feedback, otherwise make them work with JBOD, in 
the old way

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-25 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946436#comment-13946436
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


bq. It seems to me it might also be simpler, once this change is made, to just 
split the range of the memtable and call subMap(lb, ub) and spawn a separate 
flush writer for each range, which might avoid the need for an 
SSTableWriterInterface

hmm yeah, might be better to not have the SSTWI and handle that outside to get 
more flexibility, I'll try to do that 

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-25 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946434#comment-13946434
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


bq. Can we drop BOP/OPP in 3.0?
Hmm, that would be nice. A big PITA would be to rewrite all the unit tests that 
depend on order, created CASSANDRA-6922


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945817#comment-13945817
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. Let's state for the record that solving this problem is a non-goal.

Fair enough. Just throwing it out there :)

bq. overall we will do better even with large and fragmented partitions since 
the fragmentation and merging required will be less.

Possibly. It would depend on data distribution and size of ranges. If you had 
large-ish dense ranges per-disk, this shouldn't be a problem. But I think 
either way it's a major complication so at the _very least_ not worth doing 
now, and since the distribution and split size are not tunable, maybe not ever.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945805#comment-13945805
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

bq. So if there are particularly large and fragmented partitions, they could 
see read performance decline notably

Let's state for the record that solving this problem is a non-goal. :)

(A single query might see higher latency, but overall we will do better even 
with large and fragmented partitions since the fragmentation and merging 
required will be less.)

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945631#comment-13945631
 ] 

Benedict commented on CASSANDRA-6696:
-

Last thoughts for the day: only major downside to this approach is that we are 
now guaranteeing no better than single disk performance for all operations on a 
given partition. So if there are particularly large and fragmented partitions, 
they could see read performance decline notably. One possible solution to this 
would be split by clustering part (if any), instead of partition key, but 
determine the clustering part range split as a function of the partition hash, 
so that the distribution of data as a whole is still random (i.e. each 
partition has a different clustering distribution across the disks). This would 
make the initial flush more complex, and might require more merging on reads, 
but compaction could still be easily constrained to one disk. This is just a 
poorly formed thought I'm throwing out there for consideration, and possibly 
outside of scope for this ticket.

Either way, I'm not certain that splitting ranges based on disk size is such a 
great idea. As a follow on ticket it might be sensible to permit two category 
of disks: archive for slow and cold data, and live disks for faster data. 
Splitting by capacity seems likely to create undesirable performance 
characteristics, as two similarly performant disks with different capacities 
would lead to worse performance for the data residing on the larger disks.

On the whole I'm +1 this change anyway, the more I think about it. I had been 
vaguely considering something along these lines to optimise flush performance, 
but it seems we can get this for free along with improving correctness, which 
is great.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945452#comment-13945452
 ] 

Benedict commented on CASSANDRA-6696:
-

Had a quick glance, and have one initial thought: Might be worth forcing 
compaction to always work on one disk (i.e. always selects files from one disk 
for compaction). Would simplify it slightly, and it seems likely to be the most 
optimal use of IO, but also as it stands you could have a scenario where one 
file is selected each from a different disk, which would result in a perpetual 
compaction loop.



> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945471#comment-13945471
 ] 

Benedict commented on CASSANDRA-6696:
-

It seems to me it _might_ also be simpler, once this change is made, to just 
split the range of the memtable and call subMap(lb, ub) and spawn a separate 
flush writer for each range, which might avoid the need for an 
SSTableWriterInterface... Might also be a good time to introduce a separate 
flush executor for each disk.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945194#comment-13945194
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

Can we drop BOP/OPP in 3.0?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-03-24 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945173#comment-13945173
 ] 

Marcus Eriksson commented on CASSANDRA-6696:


Been poking this, wip-patch pushed here: 
https://github.com/krummas/cassandra/commits/marcuse/6696

it does the following;
* Extract an interface out of SSTableWriter (imaginatively called 
SSTableWriterInterface), start using this interface everywhere
* Create DiskAwareSSTableWriter which knows about disk layout and starts using 
it instead of standard SSTW
* Ranges of tokens are assigned to the disks, this way we only need to check 
"is the key we are appending larger than the boundary token for the current 
disk? If so, create a new SSTableWriter for that disk
* Breaks unit tests

todo:
* fix unit tests, general cleanups
* I kind of want to name the interface SSTableWriter and call the old SSTW 
class something else, but i guess SSTW is the class that most external people 
depend on, so maybe not
* Take disk size into consideration when splitting the ranges over disks, this 
needs to be deterministic though, so we have to use total disk size instead of 
free disk space.
* Make other partitioners than M3P work
* Fix keycache

Rebalancing of data is simply running upgradesstables or scrub, if we loose a 
disk, we will take writes to the other disks

Comments on this approach?

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
>Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-02-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899519#comment-13899519
 ] 

Benedict commented on CASSANDRA-6696:
-

bq. if you lose "data" then you scrub/repair; if you lose metadata you rebuild 
it from data.

You'd always have to do both with any single disk failure. But I agree it isn't 
optimal; but it is cost-free to maintain. Simply redundantly writing out the 
metadata would change it to a more uniform process, and tolerant to more than 
one failure, but at increased cost; at which point you might as well 
redundantly write out tombstones - either as a bloom filter or an extra 
sstable. The latter could be complicated to maintain cheaply and safely though.


> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: sankalp kohli
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-02-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899425#comment-13899425
 ] 

Jonathan Ellis commented on CASSANDRA-6696:
---

bq. the whole disk_failure_policy stuff is broken

I would say rather, disk_failure_policy works brilliantly so that if you're 
using tombstones you can set it to stop the server and rebuild it. :)

bq. If we divide each drive with ranges, then we are sure that the data along 
with the tombstone will get blacklisted.

That will probably work well enough as long as vnode count >> disk count.  
Would have the added benefit of reducing fragmentation for STCS.

Less than zero interest in trying to add sub-vnode "regions" though.

bq. One possibility here is that we could split bloom filter and metadata onto 
a separate disk to their data files

Not really a fan; complicates moving data around significantly without 
generalizing well beyond a single disk failure.  Even for single disk failures 
it bifurcates the recovery process: if you lose "data" then you scrub/repair; 
if you lose metadata you rebuild it from data.

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Bug
>Reporter: sankalp kohli
>Priority: Minor
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-02-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899398#comment-13899398
 ] 

Benedict commented on CASSANDRA-6696:
-

One possibility here is that we could split bloom filter and metadata onto a 
separate disk to their data files, so that if/when a disk fails we have the 
option of scrubbing any records on the remaining disks that we think were 
present on the lost disk in a file with min_timestamp < gc_grace_seconds ago.

Once we've done the scrub (in fact it could probably be "done" instantly by 
just setting up some filter for compaction + reads until we're fully repaired 
and have compacted the old data) we can start serving reads again, and can 
start a repair from the other nodes to receive data for all of the records 
we're now missing (either through the missing disk or that we're forcefully 
trashing).

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Bug
>Reporter: sankalp kohli
>Priority: Minor
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.

2014-02-12 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899297#comment-13899297
 ] 

sankalp kohli commented on CASSANDRA-6696:
--

With this, the whole disk_failure_policy stuff is broken. If you blacklist a 
drive, you can potentially bring data back to life. 

One of the fixes of this is one of my JIRA which I fixed long back. 
CASSANDRA-4784
If we divide each drive with ranges, then we are sure that the data along with 
the tombstone will get blacklisted. 
Example: Say a node is handling range 1-10 and 11-20. We can have drive A 
handle 1-10 and drive B handle 11-20. 
Thought this might have problems with load balancing. 

> Drive replacement in JBOD can cause data to reappear. 
> --
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
>  Issue Type: Bug
>Reporter: sankalp kohli
>Priority: Minor
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)