Controlling the MAX SIZE of sstables after compaction

2015-01-25 Thread Parth Setya
Hi


*Setup*

*3 Node Cluster*
Api-
* Hector*CL-
* QUORUM*
RF-
*3*
Compaction Strategy-
*Size Tiered Compaction*

*Use Case*
I have about *320 million rows*(~12 to 15 columns each) worth of data
stored in Cassandra. In order to generate a report containing ALL that
data, I do the following:
1. Run Compaction
2. Take a snapshot of the db
3. Run sstable2json on all the *Data.db files
4. Read those jsons and write to a csv.

*Problem*:
The *sstable2json* utility takes about 350-400 hours (~85% of the total
time) thereby lengthening the process. (I am running sstable2json
sequentially on all the *Data.db files but the size of those is
inconsistent so making it run concurrently doesn't help either E.G one file
is of size 25 GB while another of 500 MB)

*My Thought Process:*
Is there a way to put a cap on the maximum size of the sstables that are
generated after compaction such that i have multiple sstables of uniform
size. Then I can run sstable2json utility on the same concurrently

*Questions:*
1. Is there a way to configure the size of sstables created after
compaction?
2. Is there a better approach to generate the report?
3. What are the flaws with this approach?

Best
Parth


SStables can't compat automaticly

2015-01-25 Thread 曹志富
Hi everybody:

I have 18 nodes using cassandra2.1.2.Every node has 4 core, 32 GB RAM, 2T
hard disk,OS is CentOS release 6.2 (Final).

I have follow the  to config my
system.such as disable SWAP,unlimited mem lock...

My heap size is:

MAX_HEAP_SIZE="8G"
MIN_HEAP_SIZE="8G"
HEAP_NEWSIZE="2G"

I use STCS,other config using default,using Datastax Java Driver 2.1.2.
BatchStatment 100key commit per time.

When I run my cluster and insert data from kafka (1 keys/s) after 2
days,every node can't compact  some there too many sstables.

I try to use major compact to compact the sstables , it cost a long long
time .Also the new sstables can't compat automatic.


I trace the log , the CMS GC too often,almost 30 minute onetime.

Could someone help me to solve this problem.


--
曹志富
手机:18611121927
邮箱:caozf.zh...@gmail.com
微博:http://weibo.com/boliza/


Re: Which Topology fits best ?

2015-01-25 Thread mck
NetworkTopogolyStrategy gives you a better horizon and more flexibility
as you scale out, at least once you've gone past small cluster problems
like wanting RF=3 in a 4 node two dc cluster. 

IMO I'd go with "DC:1,DC2:1".
~mck



Re: Which Topology fits best ?

2015-01-25 Thread Eric Stevens
As far as I know they're effectively the same.  NetworkTopologyStrategy is
useful when you want to set up separate RF per DC, such as if you want to
have an analytics DC with lower RF to save money.

On Sun, Jan 25, 2015 at 8:01 AM, SEGALIS Morgan  wrote:

> Hi everyone,
> I need one more time your precious advice.
>
> I would like to create a 2 nodes cluster, each node are on a different
> DataCenter, but with the same provider, ping between the 2 servers is fast:
> ~0,5 ms, and the bandwidth is great: ~ 1GB/s
>
> is, org.apache.cassandra.locator.SimpleStrategy with replication factor
> set to 2 is a good practice ?
>
> Or should I org.apache.cassandra.locator.NetworkTopologyStrategy with
> DC1:1 and DC2:1
> (if this is the correct way to use NetworkTopologyStrategy, not sure at
> 100%)
>
>
> Thank you for your time.
>


Which Topology fits best ?

2015-01-25 Thread SEGALIS Morgan
Hi everyone,
I need one more time your precious advice.

I would like to create a 2 nodes cluster, each node are on a different
DataCenter, but with the same provider, ping between the 2 servers is fast:
~0,5 ms, and the bandwidth is great: ~ 1GB/s

is, org.apache.cassandra.locator.SimpleStrategy with replication factor set
to 2 is a good practice ?

Or should I org.apache.cassandra.locator.NetworkTopologyStrategy with DC1:1
and DC2:1
(if this is the correct way to use NetworkTopologyStrategy, not sure at
100%)


Thank you for your time.