[Cassandra Wiki] Update of "LargeDataSetConsiderations_JP" by MakiWatanabe

Apache Wiki Thu, 10 Mar 2011 06:02:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations_JP" page has been changed by MakiWatanabe.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations_JP?action=diff&rev1=20&rev2=21

--------------------------------------------------

    * リペア操作にはある程度のディスク容量が必要です。（0.6では特に顕著です。0.7ではそれほどでもありません。TODO: 
具体的な最大値、依存するパラメータを明示すること。）
  
   * 
データ量が多くなるにつれ、ディスクIO操作を避けるためにキャッシュへの依存が強まります。キャバシティに関するプランニングとテストの際には以下のことを考慮すべきです。
-   * Cassandra 
の行キャッシュはJVMのヒープ上に存在し、compactionやrepairの影響を受けません。これは利点ですが、一方でメモリの有効利用という点では行キャッシュはOSのページキャッシュほど効率的でありません。
+   * Cassandra の行キャッシュはJVMのヒープ上に存在し、compactionやrepairの影響を受けません。
+   * For 0.6.8 and below, the key cache is affected by compaction because it 
is per-sstable, and compaction moves data to new sstables.
+ 
0.6.8以前はキーキャッシュはcompactionによって影響を受けます。それらのバージョンではキーキャッシュはSSTABLE単位で管理されているため、compactionによってデータが新しいSSTABLEにコピーされると古いキャッシュが無効になります。
+    * Was fixed/improved as of
+ この動作は0.6.9 
及び0.7.0で改善されています。[[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]]
  
-   * For 0.6.8 and below, the key cache is affected by compaction because it 
is per-sstable, and compaction moves data to new sstables.
-    * Was fixed/improved as of 
[[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]], for 
0.6.9 and 0.7.0.
-   * The operating system's page cache is affected by compaction and repair 
operations. If you are relying on the page cache to keep the active set in 
memory, you may see significant degradation on performance as a result of 
compaction and repair operations.
+ 
+   * 
OSのページキャッシュはcompaction及びrepair操作の影響を受けます。アクティブなデータをメモリ上に保つ手段としてページキャッシュに依存している場合、compaction及びrepair操作に連動して顕著な性能劣化が起きるでしょう。
+ 
+ 
-    * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]], 
[[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]].
+    * 
将来的な改善方法については以下のリンクで議論されています:[[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]],
 [[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]]
- * If you have column families with more than 143 million row keys in them, 
bloom filter false positive rates are likely to go up because of implementation 
concerns that limit the maximum size of a bloom filter. See 
[[ArchitectureInternals]] for information on how bloom filters are used. The 
negative effects of hitting this limit is that reads will start taking 
additional seeks to disk as the row count increases. Note that the effect you 
are seeing at any given moment will depend on when compaction was last run, 
because the bloom filter limit is per-sstable. It is an issue for column 
families because after a major compaction, the entire column family will be in 
a single sstable.
+ 
+  * bloom filterの最大サイズの実装上の制限により、14300万以上の行キーを格納しているカラムファミリでは、bloom 
filterの偽陽性率が増加することが予想されます。bloom 
filterがどのように使用されているかについては[[ArchitectureInternals]]を参照してください。この制限に抵触した場合、行数の増加に従ってreadごとに追加のseekが発生するようになります。bloom
 filterの制限はsstable単位であるため、上記の影響は最後にcompactionが実行された時間に依存することに注意してください。major 
compactionの後ではカラムファミリの全データが単一のsstableに格納されるため、これはカラムファミリ単位の問題です。
-   * This will likely be addressed in the future: See 
[[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]
+   * 
この問題については以下のリンクで議論されています。[[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]],
 [[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]
- * Compaction is currently not concurrent, so only a single compaction runs at 
a time. This means that sstable counts may spike during larger compactions as 
several smaller sstables are written while a large compaction is happening. 
This can cause additional seeks on reads.
+ 
+  * 
Compactionは現在は並列化されていません。即ちある瞬間に実行されるCompactionは一つのみです。これは大きなcompactionsの実行中にsstable数が増大することを意味します。大きなcompactionsの実行中は複数の小さなsstableに書き込む必要があるためです。この状態ではreadに追加のseekが必要です。
+ 
-   * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
+   * 
将来的な改善方法については以下のリンクで議論されています:[[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]],
 [[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
- * Consider the choice of file system. Removal of large files is notoriously 
slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects 
background unlink():ing of sstables that happens every now and then, and also 
affects start-up time (if there are sstables pending removal when a node is 
starting up, they are removed as part of the start-up proceess; it may thus be 
detrimental if removing a terrabyte of sstables takes an hour (numbers are 
ballparks, not accurately measured and depends on circumstances)).
- * Adding nodes is a slow process if each node is responsible for a large 
amount of data. Plan for this; do not try to throw additional hardware at a 
cluster at the last minute.
+ 
+  * 
ファイルシステムの選択について:巨大なファイルの削除は、例えばext2/ext3では恐ろしく遅く、多量のseekを要します。xfsまたはext4fsの使用を検討してください。これは恒常的に生じるsstableのバックグラウンドでのunlinkに影響します。また起動速度にも影響します。
+ 
(起動時に削除待機中のsstableがある場合、起動プロセスの一環としてそれらを削除します。従って数TBのsstableを削除するような場合は有害になるでしょう。)
+ 
+  * 各ノードが多量のデータを格納している場合、ノードの追加には時間がかかります。
+ 
+ Plan for this; do not try to throw additional hardware at a cluster at the 
last minute.
+ システムがぎりぎりまで逼迫してからノードを追加するのは避けたほうがいいでしょう。
+ 
- * Cassandra will read through sstable index files on start-up, doing what is 
known as "index sampling". This is used to keep a subset (currently and by 
default, 1 out of 100) of keys and and their on-disk location in the index, in 
memory. See [[ArchitectureInternals]]. This means that the larger the index 
files are, the longer it takes to perform this sampling. Thus, for very large 
indexes (typically when you have a very large number of keys) the index 
sampling on start-up may be a significant issue.
+  * Cassandra will read through sstable index files on start-up, doing what is 
known as "index sampling". This is used to keep a subset (currently and by 
default, 1 out of 100) of keys and and their on-disk location in the index, in 
memory. See [[ArchitectureInternals]]. This means that the larger the index 
files are, the longer it takes to perform this sampling. Thus, for very large 
indexes (typically when you have a very large number of keys) the index 
sampling on start-up may be a significant issue.
- * A negative side-effect of a large row-cache is start-up time. The periodic 
saving of the row cache information only saves the keys that are cached; the 
data has to be pre-fetched on start-up. On a large data set, this is probably 
going to be seek-bound and the time it takes to warm up the row cache will be 
linear with respect to the row cache size (assuming sufficiently large amounts 
of data that the seek bound I/O is not subject to optimization by disks).
+  * A negative side-effect of a large row-cache is start-up time. The periodic 
saving of the row cache information only saves the keys that are cached; the 
data has to be pre-fetched on start-up. On a large data set, this is probably 
going to be seek-bound and the time it takes to warm up the row cache will be 
linear with respect to the row cache size (assuming sufficiently large amounts 
of data that the seek bound I/O is not subject to optimization by disks).
    * Potential future improvement: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1625|CASSANDRA-1625]].

[Cassandra Wiki] Update of "LargeDataSetConsiderations_JP" by MakiWatanabe

Reply via email to