I investigated source code and logs of killed node.
I guess that unexpected writes are executed when truncation is being
executed.

Some writes were executed after flush (the first flush) in truncation and
these writes could be read.
These writes were requested as MUTATION by another node for hinted handoff.
Their data was stored to a new memtable and flushed (the second flush) to a
new SSTable before snapshot in truncation.
So, the truncation discarded only old SSTables, not the new SSTable.
That's because ReplayPosition which was used for discarding SSTable was
that of the first flush.

I copied some parts of log as below.
"##" line is my comment.
The point is that the ReplayPosition is moved forward by the second flush.
It means some writes are executed after the first flush.

== log ==
## started truncation
TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612
ColumnFamilyStore.java:2790 - truncating testtbl
## the first flush started before truncation
DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612
ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%)
on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:1] 2016-11-17 08:36:04,613 Memtable.java:352 -
Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816 ops,
0%/0% of on/off-heap limit)
...
DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973 Memtable.java:386 -
Completed flushing /var/lib/cassandra/data/testdb/testtbl-
562848f0a55611e68b1451065d58fdfb/tmp-lb-1-big-Data.db (17.651KiB) for
commitlog position ReplayPosition(segmentId=1479371760395, position=315867)
## this ReplayPosition was used for discarding SSTables
...
TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022 CommitLog.java:298 -
discard completed log segments for ReplayPosition(segmentId=1479371760395,
position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb
## end of the first flush
DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
ColumnFamilyStore.java:2823 - Discarding sstable data for truncated CF +
indexes
## the second flush before snapshot
DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%)
on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:2] 2016-11-17 08:36:05,029 Memtable.java:352 -
Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328 ops,
0%/0% of on/off-heap limit)
...
DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258 Memtable.java:386 -
Completed flushing /var/lib/cassandra/data/testdb/testtbl-
562848f0a55611e68b1451065d58fdfb/tmp-lb-2-big-Data.db (17.696KiB) for
commitlog position ReplayPosition(segmentId=1479371760395, position=486627)
...
TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289 CommitLog.java:298 -
discard completed log segments for ReplayPosition(segmentId=1479371760395,
position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb
## end of the second flush: position was moved
...
## only old SSTable was deleted because this SSTable was older than
ReplayPosition(segmentId=1479371760395, position=315867)
TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118 -
Deleted /var/lib/cassandra/data/testdb/testtbl-
562848f0a55611e68b1451065d58fdfb/lb-1-big
...
TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
ColumnFamilyStore.java:2841 - truncate complete
TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
TruncateVerbHandler.java:53 - Truncation(keyspace='testdb', cf='testtbl')
applied.  Enqueuing response to 36512@/10.91.145.7
TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to 36512@
/10.91.145.7
## end of truncation
====

Actually, "truncated_at" of the table on the system.local after running the
script was 0x00000158716da30b0004d1db00000158716db524.
It means segmentId=1479371760395, position=315867
truncated_at=1479371765028 (2016-11-17 08:36:05,028)

thanks,
yuji


On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <y...@imagine-orb.com> wrote:

> Hi,
>
> I could find stale data after truncating a table.
> It seems that truncating starts while recovery is being executed just
> after a node restarts.
> After the truncating finishes, recovery still continues?
> Is it expected?
>
> I use C* 2.2.8 and can reproduce it as below.
>
> ==== [create table] ====
> cqlsh $ip -e "drop keyspace testdb;"
> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '3'};"
> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val int);"
>
> ==== [script] ====
> #!/bin/sh
>
> node1_ip=<node1 IP address>
> node2_ip=<node2 IP address>
> node3_ip=<node3 IP address>
> node3_user=<user name>
> rows=10000
>
> echo "consistency quorum;" > init_data.cql
> for key in $(seq 0 $(expr $rows - 1))
> do
>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF NOT
> EXISTS;" >> init_data.cql
> done
>
> while true
> do
> echo "truncate the table"
> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> if [ $? -ne 0 ]; then
>     echo "truncating failed"
>     continue
> else
>     break
> fi
> done
>
> echo "kill C* process on node3"
> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon |
> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>
> echo "insert $rows rows"
> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>
> echo "restart C* process on node3"
> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra start"
>
> while true
> do
> echo "truncate the table again"
> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> if [ $? -ne 0 ]; then
>     echo "truncating failed"
>     continue
> else
>     break
> fi
> done
>
> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> count(*) from testdb.testtbl;"
> sleep 10
> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> count(*) from testdb.testtbl;"
>
>
> ==== [result] ====
> truncate the table
> kill C* process on node3
> insert 10000 rows
> restart C* process on node3
> 10.91.145.27: Starting Cassandra: OK
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> Consistency level set to SERIAL.
>
>  count
> -------
>    300
>
> (1 rows)
>
> Warnings :
> Aggregation query used without partition key
>
> Consistency level set to SERIAL.
>
>  count
> -------
>   2304
>
> (1 rows)
>
> Warnings :
> Aggregation query used without partition key
> ====
>
> I found it when I was investigating data lost problem. (Ref. "failure node
> rejoin" thread)
> I'm not sure this problem is related to data lost.
>
> Thanks,
> yuji
>

Reply via email to