[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

2013-11-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis:
https://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=32&rev2=33

Comment:
link SR post

   * Configuration file is parsed by !DatabaseDescriptor (which also has all 
the default values, if any)
   * Thrift generates an API interface in Cassandra.java; the implementation is 
!CassandraServer, and !CassandraDaemon ties it together (mostly: handling 
commitlog replay, and setting up the Thrift plumbing)
   * !CassandraServer turns thrift requests into the internal equivalents, then 
!StorageProxy does the actual work, then !CassandraServer turns the results 
back into thrift again
-* CQL requests are compiled and executed through QueryProcessor.  Note 
that as of 1.2 we still support both the old cql2 dialect and the cql3, in 
different packages.
+   * CQL requests are compiled and executed through QueryProcessor.  Note that 
as of 1.2 we still support both the old cql2 dialect and the cql3, in different 
packages.
   * !StorageService is kind of the internal counterpart to !CassandraDaemon.  
It handles turning raw gossip into the right internal state and dealing with 
ring changes, i.e., transferring data to new replicas.  !TokenMetadata tracks 
which nodes own what arcs of the ring.  Starting in 1.2, each node may have 
multiple Tokens.
   * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
etc. replicas of each key range.  Primary replica is always determined by the 
token ring (in !TokenMetadata) but you can do a lot of variation with the 
others.  !SimpleStrategy just puts replicas on the next N-1 nodes in the ring.  
!NetworkTopologyStrategy allows the user to define how many replicas to place 
in each datacenter, and then takes rack locality into account for each DC -- we 
want to avoid multiple replicas on the same rack, if possible.
   * !MessagingService handles connection pooling and running internal commands 
on the appropriate stage (basically, a threaded executorservice).  Stages are 
set up in !StageManager; currently there are read, write, and stream stages.  
(Streaming is for when one node copies large sections of its SSTables to 
another, for bootstrap or relocation on the ring.)  The internal commands are 
defined in !StorageService; look for `registerVerbHandlers`.
@@ -15, +15 @@

  
  = Write path =
   * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends !RowMutation messages to them.
-* If nodes are changing position on the ring, "pending ranges" are 
associated with their destinations in !TokenMetadata and these are also written 
to.
+   * If nodes are changing position on the ring, "pending ranges" are 
associated with their destinations in !TokenMetadata and these are also written 
to.
-* ConsistencyLevel determines how many replies to wait for.  See 
!WriteResponseHandler.determineBlockFor.  Interaction with pending ranges is a 
bit tricky; see https://issues.apache.org/jira/browse/CASSANDRA-833
+   * ConsistencyLevel determines how many replies to wait for.  See 
!WriteResponseHandler.determineBlockFor.  Interaction with pending ranges is a 
bit tricky; see https://issues.apache.org/jira/browse/CASSANDRA-833
-* If the FailureDetector says that we don't have enough nodes alive to 
satisfy the ConsistencyLevel, we fail the request with !UnavailableException
+   * If the FailureDetector says that we don't have enough nodes alive to 
satisfy the ConsistencyLevel, we fail the request with !UnavailableException
-* When performing atomic batches, the mutations are written to the 
batchlog on the two closest nodes in the local datacenter that are alive. If 
only one other node is alive, it alone will be used, but if no other nodes are 
alive, an UnavailableException will be returned.  If the cluster has only one 
node, it will write the batchlog entry itself.  The batchlog is contained in 
the system.batchlog table.
+   * When performing atomic batches, the mutations are written to the batchlog 
on the two closest nodes in the local datacenter that are alive. If only one 
other node is alive, it alone will be used, but if no other nodes are alive, an 
UnavailableException will be returned.  If the cluster has only one node, it 
will write the batchlog entry itself.  The batchlog is contained in the 
system.batchlog table.
-* If the FD gives us the okay but writes time out anyway because of a 
failure after the request is sent or because of an overload scenario, 
!StorageProxy will write a "hint" locally to replay the write when the 
replica(s) timing out recover.  This is called HintedHandoff.  Note that HH 
does not prevent inconsistency entirely; either unclean shutdown or hardware 
failure can prevent the coordinating node from writing or replaying the hint. 
ArchitectureAntiEntro

[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

2013-11-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis:
https://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=31&rev2=32

Comment:
link aaron's internals talk and ladis annotations

   * Crash-only design is another broadly applied principle.  
[[http://lwn.net/Articles/191059/|Valerie Henson's LWN article]] is a good 
introduction
   * Cassandra's distribution is closely related to the one presented in 
Amazon's Dynamo paper.  Read repair, adjustable consistency levels, hinted 
handoff, and other concepts are discussed there.  This is required background 
material: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html.  The 
related article on 
[[http://www.allthingsdistributed.com/2008/12/eventually_consistent.html|article
 on eventual consistency]] is also relevant.  Jeff Darcy's article on 
[[http://pl.atyp.us/wordpress/?p=2521|Availability and Partition Tolerance]] 
explains the underlying principle of CAP better than most.
   * Cassandra's on-disk storage model is loosely based on sections 5.3 and 5.4 
of [[http://labs.google.com/papers/bigtable.html|the Bigtable paper]].
-  * Facebook's Cassandra team authored a paper on Cassandra for LADIS 09: 
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf. 
Most of the information there is applicable to Apache Cassandra (the main 
exception is the integration of !ZooKeeper).
+  * Aaron Morton gave a [[http://www.youtube.com/watch?v=W6e8_IcgJM4|talk on 
Cassandra Internals]] at the 2013 Cassandra Summit.
+  * Facebook's Cassandra team authored a paper on Cassandra for LADIS 09, 
which has now been 
[[http://www.datastax.com/documentation/articles/cassandra/cassandrathenandnow.html|annotated
 and compared to Apache Cassandra 2.0]].
  
  {{https://c.statcounter.com/9397521/0/fe557aad/1/|stats}}
  


[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=28&rev2=29

Comment:
add note on names

   * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
etc. replicas of each key range.  Primary replica is always determined by the 
token ring (in !TokenMetadata) but you can do a lot of variation with the 
others.  !SimpleStrategy just puts replicas on the next N-1 nodes in the ring.  
!NetworkTopologyStrategy allows the user to define how many replicas to place 
in each datacenter, and then takes rack locality into account for each DC -- we 
want to avoid multiple replicas on the same rack, if possible.
   * !MessagingService handles connection pooling and running internal commands 
on the appropriate stage (basically, a threaded executorservice).  Stages are 
set up in !StageManager; currently there are read, write, and stream stages.  
(Streaming is for when one node copies large sections of its SSTables to 
another, for bootstrap or relocation on the ring.)  The internal commands are 
defined in !StorageService; look for `registerVerbHandlers`.
   * Configuration for the node (administrative stuff, such as which 
directories to store data in, as well as global configuration, such as which 
global partitioner to use) is held by !DatabaseDescriptor. Per-KS, per-CF, and 
per-Column metadata are all stored as parts of the Schema: !KSMetadata, 
!CFMetadata, !ColumnDefinition. See also ConfigurationNotes.
+ 
+ = Some historial baggage =
+  * Some classes have misleading names, notably !ColumnFamily (which 
represents a single row, not a table of data) and !Table (which represents a 
keyspace).
  
  = Write path =
   * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends !RowMutation messages to them.


[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

2012-09-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=27&rev2=28

Comment:
update read path

 * ConsistencyLevel determines how many replies to wait for.  See 
!WriteResponseHandler.determineBlockFor.  Interaction with pending ranges is a 
bit tricky; see https://issues.apache.org/jira/browse/CASSANDRA-833
 * If the FailureDetector says that we don't have enough nodes alive to 
satisfy the ConsistencyLevel, we fail the request with !UnavailableException
 * If the FD gives us the okay but writes time out anyway because of a 
failure after the request is sent or because of an overload scenario, 
!StorageProxy will write a "hint" locally to replay the write when the 
replica(s) timing out recover.  This is called HintedHandoff.  Note that HH 
does not prevent inconsistency entirely; either unclean shutdown or hardware 
failure can prevent the coordinating node from writing or replaying the hint. 
ArchitectureAntiEntropy is responsible for restoring consistency more 
completely.
+* Cross-datacenter writes are not sent directly to each replica; instead, 
they are sent to a single replica, with a Header in !MessageOut telling that 
replica to forward to the other ones in that datacenter
   * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand 
the write first to the !CommitLog, then to the Memtable for the appropriate 
!ColumnFamily.
   * When a Memtable is full, it gets sorted and written out as an SSTable 
asynchronously by !ColumnFamilyStore.maybeSwitchMemtable (so named because 
multiple concurrent calls to it will only flush once)
 * "Fullness" is monitored by !MeteredFlusher; the goal is to flush quickly 
enough that we don't OOM as new writes arrive while we still have to hang on to 
the memory of the old memtable during flush
@@ -26, +27 @@

  
  = Read path =
   * !StorageProxy gets the endpoints (nodes) responsible for replicas of the 
keys from the !ReplicationStrategy as a function of the row key (the key of the 
row being read)
-* This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a 
!RangeSliceReadCommand, depending
+* This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a 
!RangeSliceCommand, depending on the query type.  Secondary index queries are 
also part of !RangeSliceCommand.
   * !StorageProxy filters the endpoints to contain only those that are 
currently up/alive
   * !StorageProxy then sorts, by asking the endpoint snitch, the responsible 
nodes by "proximity".
 * The definition of "proximity" is up to the endpoint snitch
   * With a SimpleSnitch, proximity directly corresponds to proximity on 
the token ring.
   * With implementations based on AbstractNetworkTopologySnitch (such as 
PropertyFileSnitch), endpoints that are in the same rack are always considered 
"closer" than those that are not. Failing that, endpoints in the same data 
center are always considered "closer" than those that are not.
-  * The DynamicSnitch, typically enabled in the configuration, wraps 
whatever underlying snitch (such as SimpleSnitch and NetworkTopologySnitch) so 
as to dynamically adjust the perceived "closeness" of endpoints based on their 
recent performance. This is in an effort to try to avoid routing traffic to 
endpoints that are slow to respond.
+  * The DynamicSnitch, typically enabled in the configuration, wraps 
whatever underlying snitch (such as SimpleSnitch and PropertyFileSnitch) so as 
to dynamically adjust the perceived "closeness" of endpoints based on their 
recent performance. This is an effort to try to avoid routing more traffic to 
endpoints that are slow to respond.
   * !StorageProxy then arranges for messages to be sent to nodes as required:
 * The closest node (as determined by proximity sorting as described above) 
will be sent a command to perform an actual data read (i.e., return data to the 
co-ordinating node). 
 * As required by consistency level, additional nodes may be sent digest 
commands, asking them to perform the read locally but send back the digest only.
   * For example, at replication factor 3 a read at consistency level 
QUORUM would require one digest read in additional to the data read sent to the 
closest node. (See ReadCallback, instantiated by StorageProxy)
 * If read repair is enabled (probabilistically if read repair chance is 
somewhere between 0% and 100%), remaining nodes responsible for the row will be 
sent messages to compute the digest of the response. (Again, see ReadCallback, 
instantiated by StorageProxy)
-  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily 
or CFS.getRangeSlice and sends it back as a !ReadResponse
+  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily, 
CFS.getRa

[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

2012-09-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=26&rev2=27

Comment:
update general and write sections

  = General =
   * Configuration file is parsed by !DatabaseDescriptor (which also has all 
the default values, if any)
-  * Thrift generates an API interface in Cassandra.java; the implementation is 
!CassandraServer, and !CassandraDaemon ties it together.
+  * Thrift generates an API interface in Cassandra.java; the implementation is 
!CassandraServer, and !CassandraDaemon ties it together (mostly: handling 
commitlog replay, and setting up the Thrift plumbing)
-  * !CassandraServer turns thrift requests into the internal equivalents, then 
!StorageProxy does the actual work, then !CassandraServer turns it back into 
thrift again
+  * !CassandraServer turns thrift requests into the internal equivalents, then 
!StorageProxy does the actual work, then !CassandraServer turns the results 
back into thrift again
-  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  
It handles turning raw gossip into the right internal state.
-  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
etc. replicas of each key range.  Primary replica is always determined by the 
token ring (in !TokenMetadata) but you can do a lot of variation with the 
others.  !RackUnaware just puts replicas on the next N-1 nodes in the ring.  
!RackAware puts the first non-primary replica in the next node in the ring in 
ANOTHER data center than the primary; then the remaining replicas in the same 
as the primary.
+* CQL requests are compiled and executed through QueryProcessor.  Note 
that as of 1.2 we still support both the old cql2 dialect and the cql3, in 
different packages.
+  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  
It handles turning raw gossip into the right internal state and dealing with 
ring changes, i.e., transferring data to new replicas.  !TokenMetadata tracks 
which nodes own what arcs of the ring.  Starting in 1.2, each node may have 
multiple Tokens.
+  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
etc. replicas of each key range.  Primary replica is always determined by the 
token ring (in !TokenMetadata) but you can do a lot of variation with the 
others.  !SimpleStrategy just puts replicas on the next N-1 nodes in the ring.  
!NetworkTopologyStrategy allows the user to define how many replicas to place 
in each datacenter, and then takes rack locality into account for each DC -- we 
want to avoid multiple replicas on the same rack, if possible.
   * !MessagingService handles connection pooling and running internal commands 
on the appropriate stage (basically, a threaded executorservice).  Stages are 
set up in !StageManager; currently there are read, write, and stream stages.  
(Streaming is for when one node copies large sections of its SSTables to 
another, for bootstrap or relocation on the ring.)  The internal commands are 
defined in !StorageService; look for `registerVerbHandlers`.
-  * Configuration for the node (administrative stuff, such as which 
directories to store data in, as well as global configuration, such as which 
global partitioner to use) is held by !DatabaseDescriptor. Per-KS, per-CF, and 
per-Column metadata are all stored as migrations across the database and can be 
updated by calls to system_update/add_* thrift calls, or can be changed locally 
and temporarily at runtime. See ConfigurationNotes.
+  * Configuration for the node (administrative stuff, such as which 
directories to store data in, as well as global configuration, such as which 
global partitioner to use) is held by !DatabaseDescriptor. Per-KS, per-CF, and 
per-Column metadata are all stored as parts of the Schema: !KSMetadata, 
!CFMetadata, !ColumnDefinition. See also ConfigurationNotes.
  
  = Write path =
   * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends !RowMutation messages to them.
 * If nodes are changing position on the ring, "pending ranges" are 
associated with their destinations in !TokenMetadata and these are also written 
to.
-* If nodes that should accept the write are down, but the remaining nodes 
can fulfill the requested !ConsistencyLevel, the writes for the down nodes will 
be sent to another node instead, with a header (a "hint") saying that data 
associated with that key should be sent to the replica node when it comes back 
up.  This is called HintedHandoff and reduces the "eventual" in "eventual 
consistency."  Note that HintedHandoff is only an '''optimization'''; 
ArchitectureAntiEntropy is responsible for restoring consistency more 
completely.
+* ConsistencyLevel determines how many replies to wait for.  See 

[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

2011-03-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis.
The comment on this change is: fix regression in compaction description.
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=18&rev2=19

--

 * If nodes that should accept the write are down, but the remaining nodes 
can fulfill the requested !ConsistencyLevel, the writes for the down nodes will 
be sent to another node instead, with a header (a "hint") saying that data 
associated with that key should be sent to the replica node when it comes back 
up.  This is called HintedHandoff and reduces the "eventual" in "eventual 
consistency."  Note that HintedHandoff is only an '''optimization'''; 
ArchitectureAntiEntropy is responsible for restoring consistency more 
completely.
   * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand 
the write first to !CommitLog.java, then to the Memtable for the appropriate 
!ColumnFamily.
   * When a Memtable is full, it gets sorted and written out as an SSTable 
asynchronously by !ColumnFamilyStore.switchMemtable
-* When enough SSTables exist, they are merged by 
!ColumnFamilyStore.forceMajorCompaction
+* When enough SSTables exist, they are merged by 
!CompactionManager.doCompaction
   * Making this concurrency-safe without blocking writes or reads while we 
remove the old SSTables from the list and add the new one is tricky, because 
naive approaches require waiting for all readers of the old sstables to finish 
before deleting them (since we can't know if they have actually started opening 
the file yet; if they have not and we delete the file first, they will error 
out).  The approach we have settled on is to not actually delete old SSTables 
synchronously; instead we register a phantom reference with the garbage 
collector, so when no references to the SSTable exist it will be deleted.  (We 
also write a compaction marker to the file system so if the server is restarted 
before that happens, we clean out the old SSTables at startup time.)
+  * A "major" compaction of merging _all_ sstables may be manually 
initiated by the user; this results in submitMajor calling doCompaction with 
all the sstables in the ColumnFamily, rather than just sstables of similar size.
   * See [[ArchitectureSSTable]] and ArchitectureCommitLog for more details
  
  = Read path =