[Cassandra Wiki] Update of "DataModel_JP" by shot6

Apache Wiki Tue, 13 Apr 2010 02:37:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "DataModel_JP" page has been changed by shot6.
http://wiki.apache.org/cassandra/DataModel_JP?action=diff&rev1=10&rev2=11

--------------------------------------------------

  ## page was copied from DataModel
- = Introduction =
+ = イントロダクション =
  
- Cassandra has a data model that can most easily be thought of as a four or 
five dimensional hash.
+ Cassandraは4次元または5次元ハッシュなデータモデルを持っています。
  
- The basic concepts are:
-  * Cluster: the machines (nodes) in a logical Cassandra instance.  Clusters 
can contain multiple keyspaces.
-  * Keyspace: a namespace for !ColumnFamilies, typically one per application.
-  * !ColumnFamilies contain multiple columns, each of which has a name, value, 
and a timestamp, and which are referenced by row keys.
-  * !SuperColumns can be thought of as columns that themselves have subcolumns.
  
- We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
+ 基本コンセプトとしては:
+  * Cluster: 論理的なCassandraインスタンスの事。クラスタは複数のキースペースを持つことが出来る。
+  * Keyspace: !ColumnFamilies用のネームスペース。一般的には1アプリケーションに1キースペース。
+  * ColumnFamilyは複数のカラム(カラムは名前、値、タイムスタンプを持つ)を持ち、行キーで参照可能。
+  * !SuperColumnsは サブカラムを持つカラムのこと。
  
- = Columns =
  
- The column is the lowest/smallest increment of data. It's a tuple (triplet) 
that contains a name, a value and a timestamp.
  
- Here's the thrift interface definition of a Column
+ まずはボトムアップで一番小さい粒度のデータ構造であるカラムから順にみていきましょう。
+ 
+ = カラム =
+ 
+ カラムはCassandraにおける最小限のデータ構造です。その実体はタプル(triplet)で、名前、値、タイムスタンプを持ちます。
+ 
+ 
+ Thriftインタフェースのカラムの定義は以下のようになります。
  {{{
  struct Column {
    1: binary                        name,
@@ -25, +29 @@

    3: i64                           timestamp,
  }
  }}}
- And here's a column represented in JSON-ish notation:
+ 
+ 
+ JSONぽい書き方をしたカラムは以下のようになります。
  {{{
  {
    "name": "emailAddress",
@@ -34, +40 @@

  }
  }}}
  
- All values are supplied by the client, including the 'timestamp'.  This means 
that clocks on the clients should be synchronized (in the Cassandra server 
environment is useful also), as these timestamps are used for conflict 
resolution.  In many cases the 'timestamp' is not used in client applications, 
and it becomes convenient to think of a column as a name/value pair. For the 
remainder of this document, 'timestamps' will be elided for readability.  It is 
also worth noting the name and value are binary values, although in many 
applications they are UTF8 serialized strings.
  
- Timestamps can be anything you like, but microseconds since 1970 is a 
convention. Whatever you use, it must be consistent across the application 
otherwise earlier changes may overwrite newer ones.
  
- = Column Families =
+ 
タイムスタンプを含むカラムの全ての値はクライアントから渡されます。これはどういう事かというと、クライアントのクロックはCassandraサーバ環境間で同期が取れていなければいけないということです(カラムのタイムスタンプは衝突の解決に便利です)。ほとんどのケースでは、タイムスタンプはクライアントアプリケーションでは使われないので、カラムは名前と値のペアと考えると比較的楽に思い浮かびます。このドキュメントの以降の説明では、タイムスタンプを読みやすさのため明示的に載せないようにしています。また、実際保存されるカラムの名前と値はバイナリ値ですが、ほとんどのアプリケーションではUTF8でシリアライズされた文字列なのでこのドキュメントでもそうします。
  
- A column family is a container for columns, analogous to the table in a 
relational system.  You define column families in your storage-conf.xml file, 
and cannot modify them (or add new column families) without restarting your 
Cassandra process.  A column family holds an ordered list of columns, which you 
can reference by the column name.
  
- Column families have a configurable ordering applied to the columns within 
each row, which affects the behavior of the get_slice call in the thrift API.  
Out of the box ordering implementations include ASCII, UTF-8, Long, and UUID 
(lexical or time).
+ 
タイムスタンプは何でもいいのですが、便宜上マイクロ秒記載としておいてください。どのように使っても良いですが、アプリケーション間では一貫している必要があります。そうでない場合、新らしい書き込みで上書きされてしまう可能性があります。
  
- = Rows =
  
- In Cassandra, each column family is stored in a separate file, and the file 
is sorted in row (i.e. key) major order. Related columns, those that you'll 
access together, should be kept within the same column family.
  
- The row key is what determines what machine data is stored on.  Thus, for 
each key you can have data from multiple column families associated with it.  
However, these are logically distinct, which is why the Thrift interface is 
oriented around accessing one !ColumnFamily per key at a time.  (TODO given 
this, is the following JSON more confusing than helpful?)
+ = カラムファミリ =
  
- A JSON representation of the key -> column families -> column structure is
+ 
カラムファミリはカラムのコンテナといえます。リレーショナルモデルでいうところのテーブルにあたります。カラムファミリはstorage-conf.xmlで定義され、Cassandraが再起動されるまでは修正や新規カラムファミリの追加は出来ません(注：0.6以降で変更の可能性アリ)。カラムファミリはカラムネーム順でソートされたカラムのリストを持ちます。
+ 
+ 
+ 
カラムファミリは各行ごとのカラム順序を設定可能になっていて、その設定はThriftのAPI経由でget_sliceメソッドを呼んだ場合の挙動に影響を与えます。
+ 順序の維持の実装は、ASCII、UTF8、Long、レキシカルUUID、TimeUUIDの中からデフォルトでは選択できます(注：自前でも実装可能)。
+ 
+ 
+ 
+ = 行 =
+ 
+ 
Cassandraでは、各カラムファミリは別個のファイルにて管理され、行ごと(例えばキー)にソートされています。あなたが一緒にアクセスするであろう、関連したカラムは同じカラムファミリ内に保存されているはずです。
+ 
+ 
+ 行キーはどんなマシンでデータを永続化するかを決定します(The row key is what determines what machine data 
is stored on.)。それゆえ、複数のカラムファミリから取得したデータの各キーは関連している必要があります。
+ しかしながらそれらは論理的に分離されているので、Thriftインタフェースでは一回のアクセスで1つのカラムファミリのキーしか取れないようになっています。
+ 
+ 
+ JSON表現では、キー -> カラムファミリ -> カラムの構造は以下のようになります:
  {{{
  {
     "mccv":{
@@ -71, +89 @@

  }
  }}}
  
- Note that the key "mccv" identifies data in two different column families, 
"Users" and "Stats". This does not imply that data from these column families 
is related.  The semantics of having data for the same key in two different 
column families is entirely up to the application.  Also note that within the 
"Users" column family, "mccv" and "user2" have different column names defined.  
This is perfectly valid in Cassandra.  In fact there may be a virtually 
unlimited set of column names defined, which leads to fairly common use of the 
column name as a piece of runtime populated data.  This is unusual in storage 
systems, particularly if you're coming from the RDBMS world.
  
- = Keyspaces =
  
- A keyspace is the first dimension of the Cassandra hash, and is the container 
for column families. Keyspaces are of roughly the same granularity as a schema 
or database (i.e. a logical collection of tables) in the RDBMS world.  They are 
the configuration and management point for column families, and is also the 
structure on which batch inserts are applied.
+ 
上記のサンプルで、"mccv"キーが2つの異なるカラムファミリ、"Users"と"Stats"、を識別します。これはカラムファミリ間でデータの関連性があるということを示しているわけではない点に注意です。1つのキーで異なるカラムファミリを取得できるという事に関してどういう意味があるかは、アプリケーションにゆだねられています。また、"Users"カラムファミリをみると、"mccv"と"user2"というキーで別々のカラム名が定義されている点に注目してください。Cassandraではこのような定義は全く問題ありません。事実として、Cassandraではカラム名のセットを無限に作成することがおそらく出来るので、カラム名を実行時に増加させたりする使い方も一般的だということです。これは永続化システムではあまり一般的ではないことです。(特にRDBMSの世界から凝られた開発者の方には)
  
- = Super Columns =
  
- So far we've covered "normal" columns and rows.  Cassandra also supports 
super columns: columns whose values are super columns; that is, a super column 
is a (sorted) associative array of columns.
  
- One can thus think of columns and super columns in terms of maps: A row in a 
regular column family is basically a sorted map of column names to column 
values; a row in a super column family is a sorted map of super column names to 
maps of column names to column values.
+ = キースペース =
  
- A JSON description of this layout:
+ 
キースペースはカラムファミリのコンテナです。Cassandraハッシュの1次元目に位置します。キースペースはRDBMSワールドでいうところのスキーマまたはデータベース、論理的なテーブルの集合を扱う概念、と大体同じ粒度のものです。キースペースはカラムファミリに対しての設定と管理ポイントを設けており、バッチインサートが適用される構造でもあります。
+ 
+ 
+ 
+ = スーパーカラム =
+ 
+ 
ここまでで通常のカラムと行はおおまかにカバーしました。それに加え、Cassandraはスーパーカラムをサポートしています。スーパーカラムとは、ソート済みのカラムの連想配列のことです。
+ 
+ 
+ 
そのため、カラムとスーパーカラムの関係はマップとして考えることも出来ます。一般的なカラムファミリの1つの行はカラム名と値のソート済みマップで、スーパーカラムファミリの1つの行は、スーパーカラム名をキーとして、値がカラム名と値のマップのソート済みマップといえます。
+ 
+ 
+ JSON記述ではこのようなデータ構造になります:
  {{{
  {
    "mccv": {
@@ -99, +125 @@

    }
  }
  }}}
- Here my column family is "Tags".  I have two super columns defined here, 
"cassandra" and "thrift".  Within these I have specific named bookmarks, each 
of which is a column.
  
- Just like normal columns, super columns are sparse: each row may contain as 
many or as few as it likes; Cassandra imposes no restrictions.
  
- = Range queries =
+ 
この例ではカラムファミリは"Tags"で、"cassandra"と"thrift"という2つのスーパーカラムを持ちます。またカラムとして、名前付きのブックマークを持ちます。
  
- Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored, but requires choosing node tokens carefully or active 
load balancing.  Systems that only support hash-based partitioning cannot 
perform range queries efficiently.
+ 
普通のカラムと同じように、スーパーカラムは互いに疎な関係で、各行はそれぞれ大小に関わらず別個のカラムを持つことが出来ます。Cassandraはそこに制約はありません。
  
- = Modeling your application =
  
- Unlike with relational systems, where you model entities and relationships 
and then just add indexes to support whatever queries become necessary, with 
Cassandra you need to think about what queries you want to support efficiently 
ahead of time, and model appropriately.  Since there are no 
automatically-provided indexes, you will be much closer to one !ColumnFamily 
per query than you would have been with tables:queries relationally.  Don't be 
afraid to denormalize accordingly; Cassandra is much, much faster at writes 
than relational systems.
+ = レンジクエリ =
  
- Arin Sarkissian of Digg has an excellent post detailing 
[[http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model|Cassandra's 
data model]] with highly illustrative examples.
+ 
Cassandraは比較的小さなコードで、プラガブルなスキーマパーティショニングをサポートします。デフォルトでは、CassandraはハッシュベースのRandomPartitionerとOrderPreservingPartitionerを持っています。RandomPartitionerは追加で何もせずとも、よいロードバランスするパーティションを提供します。一方、OrderPreservingPartitionerはストアされたキーでレンジクエリを実行できるようにしますが、ノードトークンを慎重に選択するか積極的なロードバランシングが必須になります。ハッシュベースのパーティショニングしかサポートしないシステムにおいては、レンジクエリは有効な手段とはならないでしょう。
  
- See the CassandraLimitations page for other things to keep in mind when 
designing a model.
  
- == Example: SuperColumns for Search Apps ==
  
- You can think of each super column name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the 10 most recent messages". For a pictorial explanation please refer to the 
Cassandra powerpoint slides presented at SIGMOD 2008.
+ = モデルを作成する =
  
- == Example: multiuser blog ==
+ 
クエリをを流すために、エンティティやリレーションを作ったり、インデックスを追加したりする事が必須な手段であるリレーショナルモデルベースのシステムと異なり、Cassandraではあなたがどのようなクエリが自分のシステムにとって効果的を考えて実施し、モデルを適切に作る必要があります。Cassandraでは自動的にインデックスがはられたりはしないので、1つのクエリに対し1カラムファミリという点を重点的にみていく必要があります。これは関係モデルでいうところのテーブルとクエリの関係に似ています。Cassandraはリレーショナルシステムより圧倒的に早いので、非正規化を恐れる必要はありません。
+ 
+ 
+ DiggのArin 
SarkissianさんがCassandraのデータモデルに関して、[[http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model|素晴らしいエントリ]]を書いていますのでそちらも参考にしてください。
+ 
+ 
+ 
モデルを作成する際に注意しなくてはいけない点は[[CassandraLimitations|Cassandraの制約]]にまとめてあるのでそちらも参考にしてください。
+ 
+ 
+ == 例:検索アプリのためのスーパーカラム ==
+ 
+ 検索アプリにおいては、各スーパーカラムの名前は用語として、カラムはdocidとランク、その他属性を持つと考えることが出来ます。
+ 
ユーザIDをキーにもつとすると、各ユーザごとのインデックスをこの形式の中に押し込めることが出来ます。このようにすると、用語検索における各ユーザ毎のインデックスがFacebookのInboxサーチのように考えることが可能になります。更に、時間別にディスクにデータを書いておけば、"最新10個のメッセージをください"といったクエリにも用意に対応できるシステムを作ることが出来ます。これらについてもっと図解つきの説明はSIGMOD
 2008のCassandra資料を参照してみてください。
+ 
+ 
+ 
+ == 例: ブログアプリケーション ==
  
  TODO
  
  = Thrift API =
  
- Moved to [[API]].
+ [[API]]に移動しました。
  
  = Attribution =
+ 
+ 
  Thanks to phatduckk and asenchi for coming up with examples, text, and 
reviewing concepts.

[Cassandra Wiki] Update of "DataModel_JP" by shot6

Reply via email to