[jira] [Commented] (CASSANDRA-7582) 2.1 multi-dc upgrade errors

Benedict (JIRA) Mon, 28 Jul 2014 11:15:17 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076509#comment-14076509
 ]


Benedict commented on CASSANDRA-7582:
-------------------------------------

I'm -1000 on encountering an error and silently swallowing it on something as 
core to correctness as the commit log - this at least gives the user a big red 
flag they may want to seek expert help. I think there are two distinct problems 
here - there are the 'unexpected' errors which should almost certainly involve 
the user seeking help from an expert to diagnose (or perhaps JIRA, since it 
possibly means a bug), and the unknown table exceptions. The latter are 
debatably more ok to ignore, but I would much rather we simply retain 
information about dropped tables, much as we do truncated tables, so that we 
can suppress those known to have been dropped (with knowledge of exactly _when_ 
they were dropped, so if we see CL records past that time we can still fail and 
ask the user to at least file a bug report). 

Consider the following (pretty plausible scenario):

* User turns on CL saving
* User creates table X, populates it with some data (let's say it's a fairly 
static dataset) 
* User uses the database for a period, mostly changing other tables
* At time T, user drops table X, recreates it (instead of, e.g. truncate (which 
is separately also subtly dangerous in this scenario), and repopulates it with 
subtly but business-wise importantly different data
* Some time after T, user has to restore the cluster, and restores the schema 
from prior to T by mistake (let's say the team member restoring doesn't realise 
the table was recreated since then), then performs a PIT restore

The user now has no idea they have stale business data in their tables. Now, 
assuming we have saved the ids of all dropped tables we could report to the 
user that they are likely restoring data from a future schema, and they could 
then decide if this was safe or not; in this case they would be able to restore 
a newer schema (assuming they had saved it) and a major business error would 
have been averted.

In general this fail-fast is likely to result in an increase in JIRA filing, 
and possibly for relatively benign bugs, but on the whole I would prefer that 
scenario than leaving subtle bugs in the CL. We've already caught at least one 
as a result of this, and we've had long standing bugs with respect to drain 
that still affect 2.0 that would have been caught a long time ago with better 
reporting.



> 2.1 multi-dc upgrade errors
> ---------------------------
>
>                 Key: CASSANDRA-7582
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7582
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Ryan McGuire
>            Assignee: Benedict
>            Priority: Critical
>             Fix For: 2.1.1
>
>
> Multi-dc upgrade [was working from 2.0 -> 2.1 fairly 
> recently|http://cassci.datastax.com/job/cassandra_upgrade_dtest/55/testReport/upgrade_through_versions_test/TestUpgrade_from_cassandra_2_0_latest_tag_to_cassandra_2_1_HEAD/],
>  but is currently failing.
> Running 
> upgrade_through_versions_test.py:TestUpgrade_from_cassandra_2_0_HEAD_to_cassandra_2_1_HEAD.bootstrap_multidc_test
>  I get the following errors when starting 2.1 upgraded from 2.0:
> {code}
> ERROR [main] 2014-07-21 23:54:20,862 CommitLog.java:143 - Commit log replay 
> failed due to replaying a mutation for a missing table. This error can be 
> ignored by providing -Dcassandra.commitlog.stop_on_missing_tables=false on 
> the command line
> ERROR [main] 2014-07-21 23:54:20,869 CassandraDaemon.java:474 - Exception 
> encountered during startup
> java.lang.RuntimeException: 
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find 
> cfId=a1b676f3-0c5d-3276-bfd5-07cf43397004
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) 
> [main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:457)
>  [main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:546) 
> [main/:na]
> Caused by: org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't 
> find cfId=a1b676f3-0c5d-3276-bfd5-07cf43397004
>         at 
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
>  ~[main/:na]
>         at 
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
>  ~[main/:na]
>         at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:353)
>  ~[main/:na]
>         at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:333)
>  ~[main/:na]
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:365)
>  ~[main/:na]
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:98)
>  ~[main/:na]
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:137) 
> ~[main/:na]
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:115) 
> ~[main/:na]
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7582) 2.1 multi-dc upgrade errors

Reply via email to