Re: Prevent queries from OOM nodes

2012-10-01 Thread Віталій Тимчишин
It's not about columns, it's about rows, see example statement.
In QueryProcessor#processStatement it reads rows into list, then does
list.size()

2012/10/1 aaron morton 

> CQL will read everything into List to make latter a count.
>
>
> From 1.0 onwards count paginated reading the columns. What version are you
> on ?
>
> https://issues.apache.org/jira/browse/CASSANDRA-2894
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 26/09/2012, at 8:26 PM, Віталій Тимчишин  wrote:
>
> Actually an easy way to put cassandra down is
> select count(*) from A limit 1000
> CQL will read everything into List to make latter a count.
>
> 2012/9/26 aaron morton 
>
>> Can you provide some information on the queries and the size of the data
>> they traversed ?
>>
>> The default maximum size for a single thrift message is 16MB, was it
>> larger than that ?
>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L375
>>
>> Cheers
>>
>>
>> On 25/09/2012, at 8:33 AM, Bryce Godfrey 
>> wrote:
>>
>> Is there anything I can do on the configuration side to prevent nodes
>> from going OOM due to queries that will read large amounts of data and
>> exceed the heap available? 
>> ** **
>> For the past few days of we had some nodes consistently freezing/crashing
>> with OOM.  We got a heap dump into MAT and figured out the nodes were dying
>> due to some queries for a few extremely large data sets.  Tracked it back
>> to an app that just didn't prevent users from doing these large queries,
>> but it seems like Cassandra could be smart enough to guard against this
>> type of thing?
>> ** **
>> Basically some kind of setting like "if the data to satisfy query >
>> available heap then throw an error to the caller and about query".  I would
>> much rather return errors to clients then crash a node, as the error is
>> easier to track down that way and resolve.
>> ** **
>> Thanks.
>>
>>
>>
>
>
> --
> Best regards,
>  Vitalii Tymchyshyn
>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: downgrade from 1.1.4 to 1.0.X

2012-09-27 Thread Віталій Тимчишин
I suppose the way is to convert all SST to json, then install previous
version, convert back and load

2012/9/24 Arend-Jan Wijtzes 

> On Thu, Sep 20, 2012 at 10:13:49AM +1200, aaron morton wrote:
> > No.
> > They use different minor file versions which are not backwards
> compatible.
>
> Thanks Aaron.
>
> Is upgradesstables capable of downgrading the files to 1.0.8?
> Looking for a way to make this work.
>
> Regards,
> Arend-Jan
>
>
> > On 18/09/2012, at 11:18 PM, Arend-Jan Wijtzes 
> wrote:
> >
> > > Hi,
> > >
> > > We are running Cassandra 1.1.4 and like to experiment with
> > > Datastax Enterprise which uses 1.0.8. Can we safely downgrade
> > > a production cluster or is it incompatible? Any special steps
> > > involved?
>
> --
> Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl
>



-- 
Best regards,
 Vitalii Tymchyshyn


Re: Prevent queries from OOM nodes

2012-09-26 Thread Віталій Тимчишин
Actually an easy way to put cassandra down is
select count(*) from A limit 1000
CQL will read everything into List to make latter a count.

2012/9/26 aaron morton 

> Can you provide some information on the queries and the size of the data
> they traversed ?
>
> The default maximum size for a single thrift message is 16MB, was it
> larger than that ?
> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L375
>
> Cheers
>
>
> On 25/09/2012, at 8:33 AM, Bryce Godfrey 
> wrote:
>
> Is there anything I can do on the configuration side to prevent nodes from
> going OOM due to queries that will read large amounts of data and exceed
> the heap available? 
> ** **
> For the past few days of we had some nodes consistently freezing/crashing
> with OOM.  We got a heap dump into MAT and figured out the nodes were dying
> due to some queries for a few extremely large data sets.  Tracked it back
> to an app that just didn’t prevent users from doing these large queries,
> but it seems like Cassandra could be smart enough to guard against this
> type of thing?
> ** **
> Basically some kind of setting like “if the data to satisfy query >
> available heap then throw an error to the caller and about query”.  I would
> much rather return errors to clients then crash a node, as the error is
> easier to track down that way and resolve.
> ** **
> Thanks.
>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: any ways to have compaction use less disk space?

2012-09-25 Thread Віталій Тимчишин
See my comments inline

2012/9/25 Aaron Turner 

> On Mon, Sep 24, 2012 at 10:02 AM, Віталій Тимчишин 
> wrote:
> > Why so?
> > What are pluses and minuses?
> > As for me, I am looking for number of files in directory.
> > 700GB/512MB*5(files per SST) = 7000 files, that is OK from my view.
> > 700GB/5MB*5 = 70 files, that is too much for single directory, too
> much
> > memory used for SST data, too huge compaction queue (that leads to
> strange
> > pauses, I suppose because of compactor thinking what to compact next),...
>
>
> Not sure why a lot of files is a problem... modern filesystems deal
> with that pretty well.
>

May be. May be it's not filesystem, but cassandra. I've seen slowdowns of
compaction when the compaction queue is too large. And it can be too large
if you have a lot of SSTables. Note that each SSTable is both FS metadata
(and FS metadata cache can be limited) and cassandra in-memory data.
Anyway, as for me, performance test would be great in this area. Otherwise
it's all speculations.



> Really large sstables mean that compactions now are taking a lot more
> disk IO and time to complete.


As for me, this point is valid only when your flushes are small. Otherwise
you still need to compact the whole key range flush cover, no matter if
this is one large file or multiple small ones. One large file can even be
cheapier to compact.


> Remember, Leveled Compaction is more
> disk IO intensive, so using large sstables makes that even worse.
> This is a big reason why the default is 5MB. Also, each level is 10x
> the size as the previous level.  Also, for level compaction, you need
> 10x the sstable size worth of free space to do compactions.  So now
> you need 5GB of free disk, vs 50MB of free disk.
>

I really don't think 5GB of free space is too much :)


>
> Also, if you're doing deletes in those CF's, that old, deleted data is
> going to stick around a LOT longer with 512MB files, because it can't
> get deleted until you have 10x512MB files to compact to level 2.
> Heaven forbid it doesn't get deleted then because each level is 10x
> bigger so you end up waiting a LOT longer to actually delete that data
> from disk.
>

But if I have small SSTables, all my data goes to high levels (4th for me
when I've had 128M setting). And it also take time for updates to reach
this level. I am not sure which way is faster.


>
> Now, if you're using SSD's then larger sstables is probably doable,
> but even then I'd guesstimate 50MB is far more reasonable then 512MB.
>

I don't think SSD are great for writes/compaction. Cassandra does this in
streaming fashion and regular HDDs are faster then SSDs for linear
read/write. SSD are good for random access, that for cassandra means reads.

P.S. I still think my way is better, yet it would be great to perform some
real tests.


> -Aaron
>
>
> > 2012/9/23 Aaron Turner 
> >>
> >> On Sun, Sep 23, 2012 at 8:18 PM, Віталій Тимчишин 
> >> wrote:
> >> > If you think about space, use Leveled compaction! This won't only
> allow
> >> > you
> >> > to fill more space, but also will shrink you data much faster in case
> of
> >> > updates. Size compaction can give you 3x-4x more space used than there
> >> > are
> >> > live data. Consider the following (our simplified) scenario:
> >> > 1) The data is updated weekly
> >> > 2) Each week a large SSTable is written (say, 300GB) after full update
> >> > processing.
> >> > 3) In 3 weeks you will have 1.2TB of data in 3 large SSTables.
> >> > 4) Only after 4th week they all will be compacted into one 300GB
> >> > SSTable.
> >> >
> >> > Leveled compaction've tamed space for us. Note that you should set
> >> > sstable_size_in_mb to reasonably high value (it is 512 for us with
> >> > ~700GB
> >> > per node) to prevent creating a lot of small files.
> >>
> >> 512MB per sstable?  Wow, that's freaking huge.  From my conversations
> >> with various developers 5-10MB seems far more reasonable.   I guess it
> >> really depends on your usage patterns, but that seems excessive to me-
> >> especially as sstables are promoted.
> >>
> >
> > --
> > Best regards,
> >  Vitalii Tymchyshyn
>
>
>
> --
> Aaron Turner
> http://synfin.net/ Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
> Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
> "carpe diem quam minimum credula postero"
>



-- 
Best regards,
 Vitalii Tymchyshyn


Re: any ways to have compaction use less disk space?

2012-09-24 Thread Віталій Тимчишин
Why so?
What are pluses and minuses?
As for me, I am looking for number of files in directory.
700GB/512MB*5(files per SST) = 7000 files, that is OK from my view.
700GB/5MB*5 = 70 files, that is too much for single directory, too much
memory used for SST data, too huge compaction queue (that leads to strange
pauses, I suppose because of compactor thinking what to compact next),...

2012/9/23 Aaron Turner 

> On Sun, Sep 23, 2012 at 8:18 PM, Віталій Тимчишин 
> wrote:
> > If you think about space, use Leveled compaction! This won't only allow
> you
> > to fill more space, but also will shrink you data much faster in case of
> > updates. Size compaction can give you 3x-4x more space used than there
> are
> > live data. Consider the following (our simplified) scenario:
> > 1) The data is updated weekly
> > 2) Each week a large SSTable is written (say, 300GB) after full update
> > processing.
> > 3) In 3 weeks you will have 1.2TB of data in 3 large SSTables.
> > 4) Only after 4th week they all will be compacted into one 300GB SSTable.
> >
> > Leveled compaction've tamed space for us. Note that you should set
> > sstable_size_in_mb to reasonably high value (it is 512 for us with ~700GB
> > per node) to prevent creating a lot of small files.
>
> 512MB per sstable?  Wow, that's freaking huge.  From my conversations
> with various developers 5-10MB seems far more reasonable.   I guess it
> really depends on your usage patterns, but that seems excessive to me-
> especially as sstables are promoted.
>
>
-- 
Best regards,
 Vitalii Tymchyshyn


Re: any ways to have compaction use less disk space?

2012-09-23 Thread Віталій Тимчишин
If you think about space, use Leveled compaction! This won't only allow you
to fill more space, but also will shrink you data much faster in case of
updates. Size compaction can give you 3x-4x more space used than there are
live data. Consider the following (our simplified) scenario:
1) The data is updated weekly
2) Each week a large SSTable is written (say, 300GB) after full update
processing.
3) In 3 weeks you will have 1.2TB of data in 3 large SSTables.
4) Only after 4th week they all will be compacted into one 300GB SSTable.

Leveled compaction've tamed space for us. Note that you should set
sstable_size_in_mb
to reasonably high value (it is 512 for us with ~700GB per node) to prevent
creating a lot of small files.

Best regards, Vitalii Tymchyshyn.

2012/9/20 Hiller, Dean 

> While diskspace is cheap, nodes are not that cheap, and usually systems
> have a 1T limit on each node which means we would love to really not add
> more nodes until we hit 70% disk space instead of the normal 50% that we
> have read about due to compaction.
>
> Is there any way to use less disk space during compactions?
> Is there any work being done so that compactions take less space in the
> future meaning we can buy less nodes?
>
> Thanks,
> Dean
>



-- 
Best regards,
 Vitalii Tymchyshyn


Re: persistent compaction issue (1.1.4 and 1.1.5)

2012-09-19 Thread Віталій Тимчишин
I did see problems with schema agreement on 1.1.4, but they did go away
after rolling restart (BTW: it would be still good to check describe schema
for unreachable). Same rolling restart helped to force compactions after
moving to Leveled compaction. If your compactions still don't go, you can
try removing *.json files from the data directory of the stopped node to
force moving all SSTables to level0.

Best regards, Vitalii Tymchyshyn

2012/9/19 Michael Kjellman 

> Potentially the pending compactions are a symptom and not the root
> cause/problem.
>
> When updating a 3rd column family with a larger sstable_size_in_mb it
> looks like the schema may not be in a good state
>
> [default@] UPDATE COLUMN FAMILY screenshots WITH
> compaction_strategy=LeveledCompactionStrategy AND
> compaction_strategy_options={sstable_size_in_mb: 200};
> 290cf619-57b0-3ad1-9ae3-e313290de9c9
> Waiting for schema agreement...
> Warning: unreachable nodes 10.8.30.102The schema has not settled in 10
> seconds; further migrations are ill-advised until it does.
> Versions are UNREACHABLE:[10.8.30.102],
> 290cf619-57b0-3ad1-9ae3-e313290de9c9:[10.8.30.15, 10.8.30.14, 10.8.30.13,
> 10.8.30.103, 10.8.30.104, 10.8.30.105, 10.8.30.106],
> f1de54f5-8830-31a6-9cdd-aaa6220cccd1:[10.8.30.101]
>
>
> However, tpstats looks good. And the schema changes eventually do get
> applied on *all* the nodes (even the ones that seem to have different
> schema versions). There are no communications issues between the nodes and
> they are all in the same rack
>
> root@:~# nodetool tpstats
> Pool NameActive   Pending  Completed   Blocked
> All time blocked
> ReadStage 0 01254592 0
> 0
> RequestResponseStage  0 09480827 0
> 0
> MutationStage 0 08662263 0
> 0
> ReadRepairStage   0 0 339158 0
> 0
> ReplicateOnWriteStage 0 0  0 0
> 0
> GossipStage   0 01469197 0
> 0
> AntiEntropyStage  0 0  0 0
> 0
> MigrationStage0 0   1808 0
> 0
> MemtablePostFlusher   0 0248 0
> 0
> StreamStage   0 0  0 0
> 0
> FlushWriter   0 0248 0
> 4
> MiscStage 0 0  0 0
> 0
> commitlog_archiver0 0  0 0
> 0
> InternalResponseStage 0 0   5286 0
> 0
> HintedHandoff 0 0 21 0
> 0
>
> Message type   Dropped
> RANGE_SLICE  0
> READ_REPAIR  0
> BINARY   0
> READ 0
> MUTATION 0
> REQUEST_RESPONSE 0
>
> So I'm guessing maybe the different schema versions may be potentially
> stopping compactions? Will compactions still happen if there are different
> versions of the schema?
>
>
>
>
>
> On 9/18/12 11:38 AM, "Michael Kjellman"  wrote:
>
> >Thanks, I just modified the schema on the worse offending column family
> >(as determined by the .json) from 10MB to 200MB.
> >
> >Should I kick off a compaction on this cf now/repair?/scrub?
> >
> >Thanks
> >
> >-michael
> >
> >From: Віталій Тимчишин mailto:tiv...@gmail.com>>
> >Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> >mailto:user@cassandra.apache.org>>
> >To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> >mailto:user@cassandra.apache.org>>
> >Subject: Re: persistent compaction issue (1.1.4 and 1.1.5)
> >
> >I've started to use LeveledCompaction some time ago and from my
> >experience this indicates some SST on lower levels than they should be.
> >The compaction is going, moving them up level by level, but total count
> >does not change as new data goes in.
> >The numbers are pretty high as for me. Such numbers mean a lot of files
> >(over 100K in single directory) and a lot of thinking for compaction
> >executor to decide what to compact next. I can see numbers like 5K-10K
> >and still thing this is high numb

Re: Disk configuration in new cluster node

2012-09-18 Thread Віталій Тимчишин
Network also matters. It would take a lot of time sending 6TB over 1Gb
link, even fully saturating it. IMHO You can try with 10Gb, but you will
need to raise your streaming/compaction limits a lot.
Also you will need to ensure that your compaction can keep up. It is often
done in one thread and I am not sure if it will be enough for you. As of
parallel compaction, I don't know exact limitations and if it will be
working in your case.

2012/9/18 Casey Deccio 

> On Tue, Sep 18, 2012 at 1:54 AM, aaron morton wrote:
>
>> each with several disks having large capacity, totaling 10 - 12 TB.  Is
>> this (another) bad idea?
>>
>> Yes. Very bad.
>> If you had 6TB on average system with spinning disks you would measure
>> duration of repairs and compactions in days.
>>
>> If you want to store 12 TB of data you will need more machines.
>>
>>
>
> Would it help if I partitioned the computing resources of my physical
> machines into VMs?  For example, I put four VMs on each of three virtual
> machines, each with a dedicated 2TB drive.  I can now have four tokens in
> the ring and a RF of 3.  And of course, I can arrange them into a way that
> makes the most sense.  Is this getting any better, or am I missing the
> point?
>
> Casey
>



-- 
Best regards,
 Vitalii Tymchyshyn


Re: persistent compaction issue (1.1.4 and 1.1.5)

2012-09-18 Thread Віталій Тимчишин
I've started to use LeveledCompaction some time ago and from my experience
this indicates some SST on lower levels than they should be. The compaction
is going, moving them up level by level, but total count does not change as
new data goes in.
The numbers are pretty high as for me. Such numbers mean a lot of files
(over 100K in single directory) and a lot of thinking for compaction
executor to decide what to compact next. I can see numbers like 5K-10K and
still thing this is high number. If I were you, I'd increase sstable_size_in_mb
10-20 times it is now.

2012/9/17 Michael Kjellman 

> Hi All,
>
> I have an issue where each one of my nodes (currently all running at
> 1.1.5) is reporting around 30,000 pending compactions. I understand that a
> pending compaction doesn't necessarily mean it is a scheduled task however
> I'm confused why this behavior is occurring. It is the same on all nodes,
> occasionally goes down 5k pending compaction tasks, and then returns to
> 25,000-35,000 compaction tasks pending.
>
> I have tried a repair operation/scrub operation on two of the nodes and
> while compactions initially happen the number of pending compactions does
> not decrease.
>
> Any ideas? Thanks for your time.
>
> Best,
> michael
>
>
> 'Like' us on Facebook for exclusive content and other resources on all
> Barracuda Networks solutions.
>
> Visit http://barracudanetworks.com/facebook
>
>
>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: Practical node size limits

2012-09-05 Thread Віталій Тимчишин
You can try increasing streaming throttle.

2012/9/4 Dustin Wenz 

> I'm following up on this issue, which I've been monitoring for the last
> several weeks. I thought people might find my observations interesting.
>
> Ever since increasing the heap size to 64GB, we've had no OOM conditions
> that resulted in a JVM termination. Our nodes have around 2.5TB of data
> each, and the replication factor is four. IO on the cluster seems to be
> fine, though I haven't been paying particular attention to any GC hangs.
>
> The bottleneck now seems to be the repair time. If any node becomes too
> inconsistent, or needs to be replaced, the rebuilt time is over a week.
> That issue alone makes this cluster configuration unsuitable for production
> use.
>
> - .Dustin
>
> On Jul 30, 2012, at 2:04 PM, Dustin Wenz  wrote:
>
> > Thanks for the pointer! It sounds likely that's what I'm seeing. CFStats
> reports that the bloom filter size is currently several gigabytes. Is there
> any way to estimate how much heap space a repair would require? Is it a
> function of simply adding up the filter file sizes, plus some fraction of
> neighboring nodes?
> >
> > I'm still curious about the largest heap sizes that people are running
> with on their deployments. I'm considering increasing ours to 64GB (with
> 96GB physical memory) to see where that gets us. Would it be necessary to
> keep the young-gen size small to avoid long GC pauses? I also suspect that
> I may need to keep my memtable sizes small to avoid long flushes; maybe in
> the 1-2GB range.
> >
> >   - .Dustin
> >
> > On Jul 29, 2012, at 10:45 PM, Edward Capriolo 
> wrote:
> >
> >> Yikes. You should read:
> >>
> >> http://wiki.apache.org/cassandra/LargeDataSetConsiderations
> >>
> >> Essentially what it sounds like your are now running into is this:
> >>
> >> The BloomFilters for each SSTable must exist in main memory. Repair
> >> tends to create some extra data which normally gets compacted away
> >> later.
> >>
> >> Your best bet is to temporarily raise the Xmx heap and adjust the
> >> index sampling size. If you need to save the data (if it is just test
> >> data you may want to give up and start fresh)
> >>
> >> Generally the issue with the large disk configurations it is hard to
> >> keep a good ram/disk ratio. Then most reads turn into disk seeks and
> >> the throughput is low. I get the vibe people believe large stripes are
> >> going to help Cassandra. The issue is that stripes generally only
> >> increase sequential throughput, but Cassandra is a random read system.
> >>
> >> How much ram/disk you need is case dependent but 1/5 ratio of RAM to
> >> disk is where I think most people want to be, unless their system is
> >> carrying SSD disks.
> >>
> >> Again you have to keep your bloom filters in java heap memory so and
> >> design that tries to create a quatrillion small rows is going to have
> >> memory issues as well.
> >>
> >> On Sun, Jul 29, 2012 at 10:40 PM, Dustin Wenz 
> wrote:
> >>> I'm trying to determine if there are any practical limits on the
> amount of data that a single node can handle efficiently, and if so,
> whether I've hit that limit or not.
> >>>
> >>> We've just set up a new 7-node cluster with Cassandra 1.1.2 running
> under OpenJDK6. Each node is 12-core Xeon with 24GB of RAM and is connected
> to a stripe of 10 3TB disk mirrors (a total of 20 spindles each) and
> connected via dual SATA-3 interconnects. I can read and write around
> 900MB/s sequentially on the arrays. I started out with Cassandra tuned with
> all-default values, with the exception of the compaction throughput which
> was increased from 16MB/s to 100MB/s. These defaults will set the heap size
> to 6GB.
> >>>
> >>> Our schema is pretty simple; only 4 column families and each has one
> secondary index. The replication factor was set to four, and compression
> disabled. Our access patterns are intended to be about equal numbers of
> inserts and selects, with no updates, and the occasional delete.
> >>>
> >>> The first thing we did was begin to load data into the cluster. We
> could perform about 3000 inserts per second, which stayed mostly flat.
> Things started to go wrong around the time the nodes exceeded 800GB.
> Cassandra began to generate a lot of "mutations messages dropped" warnings,
> and was complaining that the heap was over 75% capacity.
> >>>
> >>> At that point, we stopped all activity on the cluster and attempted a
> repair. We did this so we could be sure that the data was fully consistent
> before continuing. Our mistake was probably trying to repair all of the
> nodes simultaneously - within an hour, Java terminated on one of the nodes
> with a heap out-of-memory message. I then increased all of the heap sizes
> to 8GB, and reduced the heap_newsize to 800MB. All of the nodes were
> restarted, and there was no no outside activity on the cluster. I then
> began a repair on a single node. Within a few hours, it OOMed again and
> exited. I then increased

Re: Failing operations & repair

2012-06-09 Thread Віталій Тимчишин
Thanks a lot. I was not sure if coordinator somehow tries to "roll-back"
transactions that failed to reach it's consistency level.
(Yet I could not imagine a method to do this, without 2-phase commit :) )

2012/6/8 aaron morton 

> I am making some cassandra presentations in Kyiv and would like to check
> that I am telling people truth :)
>
> Thanks for spreading the word :)
>
> 1) Failed (from client-side view) operation may still be applied to cluster
>
> Yes.
> If you fail with UnavailableException it's because from the coordinators
> view of the cluster there is less than CL nodes available. So retry.
> Somewhat similar story with TimedOutException.
>
> 2) Coordinator does not try anything to "roll-back" operation that failed
> because it was processed by less then consitency level number of nodes.
>
> Correct.
>
> 3) Hinted handoff works only for successfull operations.
>
> HH will be stored if the coordinator proceeds with the request.
> In 1.X HH is stored on the coordinator if a replica is down when the
> request starts and if the node does not reply in rpc_timeout.
>
> 4) Counters are not reliable because of (1)
>
> If you get a TimedOutException when writing a counter you should not
> re-send the request.
>
> 5) Read-repair may help to propagate operation that was failed it's
> consistency level, but was persisted to some nodes.
>
> Yes. It works in the background, by default is only enabled on 10% of
> requests.
> Note that RR is not the same as the Consistent Level for read. If you work
> as a CL > ONE the results from CL nodes are always compared and differences
> resolved. RR is concerned with the replicas not involved in the CL read.
>
> 6) Manual repair is still needed because of (2) and (3)
>
> Manual repair is *the* was to achieve consistency of data on disk. HH and
> RR are optimisations designed to reduce the chance of a Digest Mismatch
> during a read with CL > ONE.
> It is also essential for distributing Tombstones before they are purged by
> compaction.
>
> P.S. If some points apply only to some cassandra versions, I will be happy
> to know this too.
>
> Assume everyone for version 1.X
>
> Thanks
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 8/06/2012, at 1:20 AM, Віталій Тимчишин wrote:
>
> Hello.
>
> I am making some cassandra presentations in Kyiv and would like to check
> that I am telling people truth :)
> Could community tell me if next points are true:
> 1) Failed (from client-side view) operation may still be applied to cluster
> 2) Coordinator does not try anything to "roll-back" operation that failed
> because it was processed by less then consitency level number of nodes.
> 3) Hinted handoff works only for successfull operations.
> 4) Counters are not reliable because of (1)
> 5) Read-repair may help to propagate operation that was failed it's
> consistency level, but was persisted to some nodes.
> 6) Manual repair is still needed because of (2) and (3)
>
> P.S. If some points apply only to some cassandra versions, I will be happy
> to know this too.
> --
> Best regards,
>  Vitalii Tymchyshyn
>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Failing operations & repair

2012-06-07 Thread Віталій Тимчишин
Hello.

I am making some cassandra presentations in Kyiv and would like to check
that I am telling people truth :)
Could community tell me if next points are true:
1) Failed (from client-side view) operation may still be applied to cluster
2) Coordinator does not try anything to "roll-back" operation that failed
because it was processed by less then consitency level number of nodes.
3) Hinted handoff works only for successfull operations.
4) Counters are not reliable because of (1)
5) Read-repair may help to propagate operation that was failed it's
consistency level, but was persisted to some nodes.
6) Manual repair is still needed because of (2) and (3)

P.S. If some points apply only to some cassandra versions, I will be happy
to know this too.
-- 
Best regards,
 Vitalii Tymchyshyn


Re: Query on how to count the total number of rowkeys and columns in them

2012-05-24 Thread Віталій Тимчишин
You should read multiple "batches" specifying last key received from
previous batch as first key for next one.
For large databases I'd recommend you to use statistical approach (if it's
feasible). With random parittioner it works well.
Don't read the whole db. Knowing whole keyspace you can read part, get
number of records per key (<1), then multiply by keyspace size and get your
total.
You can even implement an algorithm that will work until required precision
is obtained (simply after each batch compare you previous and current
estimate).
For me it's enough to read ~1% of DB to get good result.

Best regards, Vitalii Tymchyshyn

2012/5/24 Prakrati Agrawal 

>  Hi
>
> ** **
>
> I am trying to learn Cassandra and I have one doubt. I am using the Thrift
> API, to count the number of row keys I am using KeyRange to specify the row
> keys. To count all of them, I specify the start and end as “new byte[0]”.
> But the count is set to 100 by default. How do I use this method to count
> the keys if I don’t know the actual number of keys in my Cassandra
> database? Please help me
>
>  **
>
-- 
Best regards,
 Vitalii Tymchyshyn


Re: Cassandra dying when gets many deletes

2012-05-06 Thread Віталій Тимчишин
Thanks a lot. It seems that a fix is commited now and fix will appear in
the next release, so I won't need my own patched cassandra :)

Best regards, Vitalii Tymchyshyn.

2012/5/3 Andrey Kolyadenko 

> Hi Vitalii,
>
> I sent patch.
>
>
> 2012/4/24 Віталій Тимчишин 
>
>> Glad you've got it working properly. I've tried to make as "local"
>> changes as possible, so changed only single value calculation. But it's
>> possible your way is better and will be accepted by cassandra maintainer.
>> Could you attach your patch to the ticket. I'd like for any fix to be
>> applied to the trunk since currently I have to make my own patched build
>> each time I upgrade because of the bug.
>>
>> Best regards, Vitalii Tymchyshyn
>>
>> 25 квітня 2012 р. 09:08 crypto five  написав:
>>
>> I agree with your observations.
>>> From another hand I found that ColumnFamily.size() doesn't calculate
>>> object size correctly. It doesn't count two fields Objects sizes and
>>> returns 0 if there is no object in columns container.
>>> I increased initial size variable value to 24 which is size of two
>>> objects(I didn't now what's correct value), and cassandra started
>>> calculating live ratio correctly, increasing trhouhput value and flushing
>>> memtables.
>>>
>>> On Tue, Apr 24, 2012 at 2:00 AM, Vitalii Tymchyshyn wrote:
>>>
>>>> **
>>>> Hello.
>>>>
>>>> For me " there are no dirty column families" in your message tells it's
>>>> possibly the same problem.
>>>> The issue is that column families that gets full row deletes only do
>>>> not get ANY SINGLE dirty byte accounted and so can't be picked by flusher.
>>>> Any ratio can't help simply because it is multiplied by 0. Check your
>>>> cfstats.
>>>>
>>>> 24.04.12 09:54, crypto five написав(ла):
>>>>
>>>> Thank you Vitalii.
>>>>
>>>>  Looking at the Jonathan's answer to your patch I think it's probably
>>>> not my case. I see that LiveRatio is calculated in my case, but
>>>> calculations look strange:
>>>>
>>>>  WARN [MemoryMeter:1] 2012-04-23 23:29:48,430 Memtable.java (line 181)
>>>> setting live ratio to maximum of 64 instead of Infinity
>>>>  INFO [MemoryMeter:1] 2012-04-23 23:29:48,432 Memtable.java (line 186)
>>>> CFS(Keyspace='lexems', ColumnFamily='countersCF') liveRatio is 64.0
>>>> (just-counted was 64.0).  calculation took 63355ms for 0 columns
>>>>
>>>>  Looking at the comments in the code: "If it gets higher than 64
>>>> something is probably broken.", looks like it's probably the problem.
>>>> Not sure how to investigate it.
>>>>
>>>> 2012/4/23 Віталій Тимчишин 
>>>>
>>>>> See https://issues.apache.org/jira/browse/CASSANDRA-3741
>>>>> I did post a fix there that helped me.
>>>>>
>>>>>
>>>>> 2012/4/24 crypto five 
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I have 50 millions of rows in column family on 4G RAM box. I
>>>>>> allocatedf 2GB to cassandra.
>>>>>> I have program which is traversing this CF and cleaning some data
>>>>>> there, it generates about 20k delete statements per second.
>>>>>> After about of 3 millions deletions cassandra stops responding to
>>>>>> queries: it doesn't react to CLI, nodetool etc.
>>>>>> I see in the logs that it tries to free some memory but can't even if
>>>>>> I wait whole day.
>>>>>> Also I see following in  the logs:
>>>>>>
>>>>>>  INFO [ScheduledTasks:1] 2012-04-23 18:38:13,333 StorageService.java
>>>>>> (line 2647) Unable to reduce heap usage since there are no dirty column
>>>>>> families
>>>>>>
>>>>>>  When I am looking at memory dump I see that memory goes to
>>>>>> ConcurrentSkipListMap(10%), HeapByteBuffer(13%), DecoratedKey(6%),
>>>>>> int[](6%), BigInteger(8.2%), ConcurrentSkipListMap$HeadIndex(7.2%),
>>>>>> ColumnFamily(6.5%), ThreadSafeSortedColumns(13.7%), long[](5.9%).
>>>>>>
>>>>>>  What can I do to make cassandra stop dying?
>>>>>> Why it can't free the memory?
>>>>>> Any ideas?
>>>>>>
>>>>>>  Thank you.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Best regards,
>>>>>  Vitalii Tymchyshyn
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Best regards,
>>  Vitalii Tymchyshyn
>>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: Cassandra dying when gets many deletes

2012-04-24 Thread Віталій Тимчишин
Glad you've got it working properly. I've tried to make as "local" changes
as possible, so changed only single value calculation. But it's possible
your way is better and will be accepted by cassandra maintainer. Could you
attach your patch to the ticket. I'd like for any fix to be applied to the
trunk since currently I have to make my own patched build each time I
upgrade because of the bug.

Best regards, Vitalii Tymchyshyn

25 квітня 2012 р. 09:08 crypto five  написав:

> I agree with your observations.
> From another hand I found that ColumnFamily.size() doesn't calculate
> object size correctly. It doesn't count two fields Objects sizes and
> returns 0 if there is no object in columns container.
> I increased initial size variable value to 24 which is size of two
> objects(I didn't now what's correct value), and cassandra started
> calculating live ratio correctly, increasing trhouhput value and flushing
> memtables.
>
> On Tue, Apr 24, 2012 at 2:00 AM, Vitalii Tymchyshyn wrote:
>
>> **
>> Hello.
>>
>> For me " there are no dirty column families" in your message tells it's
>> possibly the same problem.
>> The issue is that column families that gets full row deletes only do not
>> get ANY SINGLE dirty byte accounted and so can't be picked by flusher. Any
>> ratio can't help simply because it is multiplied by 0. Check your cfstats.
>>
>> 24.04.12 09:54, crypto five написав(ла):
>>
>> Thank you Vitalii.
>>
>>  Looking at the Jonathan's answer to your patch I think it's probably
>> not my case. I see that LiveRatio is calculated in my case, but
>> calculations look strange:
>>
>>  WARN [MemoryMeter:1] 2012-04-23 23:29:48,430 Memtable.java (line 181)
>> setting live ratio to maximum of 64 instead of Infinity
>>  INFO [MemoryMeter:1] 2012-04-23 23:29:48,432 Memtable.java (line 186)
>> CFS(Keyspace='lexems', ColumnFamily='countersCF') liveRatio is 64.0
>> (just-counted was 64.0).  calculation took 63355ms for 0 columns
>>
>>  Looking at the comments in the code: "If it gets higher than 64
>> something is probably broken.", looks like it's probably the problem.
>> Not sure how to investigate it.
>>
>> 2012/4/23 Віталій Тимчишин 
>>
>>> See https://issues.apache.org/jira/browse/CASSANDRA-3741
>>> I did post a fix there that helped me.
>>>
>>>
>>> 2012/4/24 crypto five 
>>>
>>>> Hi,
>>>>
>>>>  I have 50 millions of rows in column family on 4G RAM box. I
>>>> allocatedf 2GB to cassandra.
>>>> I have program which is traversing this CF and cleaning some data
>>>> there, it generates about 20k delete statements per second.
>>>> After about of 3 millions deletions cassandra stops responding to
>>>> queries: it doesn't react to CLI, nodetool etc.
>>>> I see in the logs that it tries to free some memory but can't even if I
>>>> wait whole day.
>>>> Also I see following in  the logs:
>>>>
>>>>  INFO [ScheduledTasks:1] 2012-04-23 18:38:13,333 StorageService.java
>>>> (line 2647) Unable to reduce heap usage since there are no dirty column
>>>> families
>>>>
>>>>  When I am looking at memory dump I see that memory goes to
>>>> ConcurrentSkipListMap(10%), HeapByteBuffer(13%), DecoratedKey(6%),
>>>> int[](6%), BigInteger(8.2%), ConcurrentSkipListMap$HeadIndex(7.2%),
>>>> ColumnFamily(6.5%), ThreadSafeSortedColumns(13.7%), long[](5.9%).
>>>>
>>>>  What can I do to make cassandra stop dying?
>>>> Why it can't free the memory?
>>>> Any ideas?
>>>>
>>>>  Thank you.
>>>>
>>>
>>>
>>>
>>>   --
>>> Best regards,
>>>  Vitalii Tymchyshyn
>>>
>>
>>
>>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: Cassandra dying when gets many deletes

2012-04-23 Thread Віталій Тимчишин
See https://issues.apache.org/jira/browse/CASSANDRA-3741
I did post a fix there that helped me.

2012/4/24 crypto five 

> Hi,
>
> I have 50 millions of rows in column family on 4G RAM box. I allocatedf
> 2GB to cassandra.
> I have program which is traversing this CF and cleaning some data there,
> it generates about 20k delete statements per second.
> After about of 3 millions deletions cassandra stops responding to queries:
> it doesn't react to CLI, nodetool etc.
> I see in the logs that it tries to free some memory but can't even if I
> wait whole day.
> Also I see following in  the logs:
>
> INFO [ScheduledTasks:1] 2012-04-23 18:38:13,333 StorageService.java (line
> 2647) Unable to reduce heap usage since there are no dirty column families
>
> When I am looking at memory dump I see that memory goes to
> ConcurrentSkipListMap(10%), HeapByteBuffer(13%), DecoratedKey(6%),
> int[](6%), BigInteger(8.2%), ConcurrentSkipListMap$HeadIndex(7.2%),
> ColumnFamily(6.5%), ThreadSafeSortedColumns(13.7%), long[](5.9%).
>
> What can I do to make cassandra stop dying?
> Why it can't free the memory?
> Any ideas?
>
> Thank you.
>



-- 
Best regards,
 Vitalii Tymchyshyn


Re: swap grows

2012-04-15 Thread Віталій Тимчишин
BTW: Are you sure system doing wrong? System may save some pages to swap
not removing them from RAM simply to have possibility to remove them later
fast if needed.

2012/4/14 ruslan usifov 

> Hello
>
> We have 6 node cluster (cassandra 0.8.10). On one node i increase java
> heap size to 6GB, and now at this node begin grows swap, but system have
> about 3GB of free memory:
>
>
> root@6wd003:~# free
>  total   used   free sharedbuffers cached
> Mem:  24733664   217028123030852  0   6792   13794724
> -/+ buffers/cache:7901296   16832368
> Swap:  1998840   23521996488
>
>
> And swap space slowly grows, but i misunderstand why?
>
>
> PS: We have JNA mlock, and set  vm.swappiness = 0
> PS: OS ubuntu 10.0.4(2.6.32-40-generic)
>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: [RELEASE CANDIDATE] Apache Cassandra 1.1.0-rc1 released

2012-04-15 Thread Віталій Тимчишин
Is the on-disk format already settled? I've thought to try betas but
impossibility to upgrade to 1.1 release stopped me.

2012/4/13 Sylvain Lebresne 

> The Cassandra team is pleased to announce the release of the first release
> candidate for the future Apache Cassandra 1.1.
>
>
-- 
Best regards,
 Vitalii Tymchyshyn


Re: Write performance compared to Postgresql

2012-04-03 Thread Віталій Тимчишин
Hello.

We are using java async thrift client.
As of ruby, it seems you need to use something like
http://www.mikeperham.com/2010/02/09/cassandra-and-eventmachine/
(Not sure as I know nothing about ruby).

Best regards, Vitalii Tymchyshyn


2012/4/3 Jeff Williams 

> Vitalii,
>
> Yep, that sounds like a good idea. Do you have any more information about
> how you're doing that? Which client?
>
> Because even with 3 concurrent client nodes, my single postgresql server
> is still out performing my 2 node cassandra cluster, although the gap is
> narrowing.
>
> Jeff
>
> On Apr 3, 2012, at 4:08 PM, Vitalii Tymchyshyn wrote:
>
> > Note that having tons of TCP connections is not good. We are using async
> client to issue multiple calls over single connection at same time. You can
> do the same.
> >
> > Best regards, Vitalii Tymchyshyn.
> >
> > 03.04.12 16:18, Jeff Williams написав(ла):
> >> Ok, so you think the write speed is limited by the client and protocol,
> rather than the cassandra backend? This sounds reasonable, and fits with
> our use case, as we will have several servers writing. However, a bit
> harder to test!
> >>
> >> Jeff
> >>
> >> On Apr 3, 2012, at 1:27 PM, Jake Luciani wrote:
> >>
> >>> Hi Jeff,
> >>>
> >>> Writing serially over one connection will be slower. If you run many
> threads hitting the server at once you will see throughput improve.
> >>>
> >>> Jake
> >>>
> >>>
> >>>
> >>> On Apr 3, 2012, at 7:08 AM, Jeff Williams
>  wrote:
> >>>
>  Hi,
> 
>  I am looking at cassandra for a logging application. We currently log
> to a Postgresql database.
> 
>  I set up 2 cassandra servers for testing. I did a benchmark where I
> had 100 hashes representing logs entries, read from a json file. I then
> looped over these to do 10,000 log inserts. I repeated the same writing to
> a postgresql instance on one of the cassandra servers. The script is
> attached. The cassandra writes appear to perform a lot worse. Is this
> expected?
> 
>  jeff@transcoder01:~$ ruby cassandra-bm.rb
>  cassandra
>  3.17   0.48   3.65 ( 12.032212)
>  jeff@transcoder01:~$ ruby cassandra-bm.rb
>  postgres
>  2.14   0.33   2.47 (  7.002601)
> 
>  Regards,
>  Jeff
> 
>  
> >
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: Compression on client side vs server side

2012-04-03 Thread Віталій Тимчишин
We are using client-side compression because of next points. Can you
confirm they are valid?
1) Server-side compression uses replication factor more CPU (3 times more
with replication factor of 3).
2) Network is used more by compression factor (as you are sending
uncompressed data over the wire).
4) Any server utility operations, like repair or move (not sure for the
latter) will decompress/compress
So, client side decompression looks way cheapier and can be very efficient
for long columns.

Best regards, Vitalii Tymchyshyn

2012/4/2 Jeremiah Jordan 

>  The server side compression can compress across columns/rows so it will
> most likely be more efficient.
> Whether you are CPU bound or IO bound depends on your application and node
> setup.  Unless your working set fits in memory you will be IO bound, and in
> that case server side compression helps because there is less to read from
> disk.  In many cases it is actually faster to read a compressed file from
> disk and decompress it, then to read an uncompressed file from disk.
>
> See Ed's post:
> "Cassandra compression is like more servers for free!"
>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_compression_is_like_getting
>
>  --
> *From:* benjamin.j.mcc...@gmail.com [benjamin.j.mcc...@gmail.com] on
> behalf of Ben McCann [b...@benmccann.com]
> *Sent:* Monday, April 02, 2012 10:42 AM
> *To:* user@cassandra.apache.org
> *Subject:* Compression on client side vs server side
>
>  Hi,
>
>  I was curious if I compress my data on the client side with Snappy
> whether there's any difference between doing that and doing it on the
> server side?  The wiki said that compression works best where each row has
> the same columns.  Does this mean the compression will be more efficient on
> the server side since it can look at multiple rows at once instead of only
> the row being inserted?  The reason I was thinking about possibly doing it
> client side was that it would save CPU on the datastore machine.  However,
> does this matter?  Is CPU typically the bottleneck on a machine or is it
> some other resource? (of course this will vary for each person, but
> wondering if there's a rule of thumb.  I'm making a web app, which
> hopefully will store about 5TB of data and have 10s of millions of page
> views per month)
>
>  Thanks,
> Ben
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: About initial token, autobootstraping and load balance

2012-01-16 Thread Віталій Тимчишин
Yep, I think I can. Here you are: https://github.com/tivv/cassandra-balancer

2012/1/15 Carlos Pérez Miguel 

> If you can partage it would be greate
>
> Carlos Pérez Miguel
>
>
>
> 2012/1/15 Віталій Тимчишин :
> > Yep. Have written groovy script this friday to perform autobalancing :)
> I am
> > going to add it to my jenkins soon.
> >
> >
> > 2012/1/15 Maxim Potekhin 
> >>
> >> I see. Sure, that's a bit more complicated and you'd have to move tokens
> >> after adding a machine.
> >>
> >> Maxim
> >>
> >>
> >>
> >> On 1/15/2012 4:40 AM, Віталій Тимчишин wrote:
> >>
> >> It's nothing wrong for 3 nodes. It's a problem for cluster of 20+ nodes,
> >> growing.
> >>
> >> 2012/1/14 Maxim Potekhin 
> >>>
> >>> I'm just wondering -- what's wrong with manual specification of tokens?
> >>> I'm so glad I did it and have not had problems with balancing and all.
> >>>
> >>> Before I was indeed stuck with 25/25/50 setup in a 3 machine cluster,
> >>> when had to move tokens to make it 33/33/33 and I screwed up a little
> in
> >>> that the first one did not start with 0, which is not a good idea.
> >>>
> >>> Maxim
> >>>
> >>>
> >>
> >> --
> >> Best regards,
> >>  Vitalii Tymchyshyn
> >>
> >>
> >
> >
> >
> > --
> > Best regards,
> >  Vitalii Tymchyshyn
>



-- 
Best regards,
 Vitalii Tymchyshyn


Re: About initial token, autobootstraping and load balance

2012-01-15 Thread Віталій Тимчишин
Yep. Have written groovy script this friday to perform autobalancing :) I
am going to add it to my jenkins soon.

2012/1/15 Maxim Potekhin 

>  I see. Sure, that's a bit more complicated and you'd have to move tokens
> after adding a machine.
>
> Maxim
>
>
>
> On 1/15/2012 4:40 AM, Віталій Тимчишин wrote:
>
> It's nothing wrong for 3 nodes. It's a problem for cluster of 20+ nodes,
> growing.
>
> 2012/1/14 Maxim Potekhin 
>
>>  I'm just wondering -- what's wrong with manual specification of tokens?
>> I'm so glad I did it and have not had problems with balancing and all.
>>
>> Before I was indeed stuck with 25/25/50 setup in a 3 machine cluster,
>> when had to move tokens to make it 33/33/33 and I screwed up a little in
>> that the first one did not start with 0, which is not a good idea.
>>
>> Maxim
>>
>>
>>
>  --
> Best regards,
>  Vitalii Tymchyshyn
>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: About initial token, autobootstraping and load balance

2012-01-15 Thread Віталій Тимчишин
It's nothing wrong for 3 nodes. It's a problem for cluster of 20+ nodes,
growing.

2012/1/14 Maxim Potekhin 

>  I'm just wondering -- what's wrong with manual specification of tokens?
> I'm so glad I did it and have not had problems with balancing and all.
>
> Before I was indeed stuck with 25/25/50 setup in a 3 machine cluster, when
> had to move tokens to make it 33/33/33 and I screwed up a little in that
> the first one did not start with 0, which is not a good idea.
>
> Maxim
>
>
>
-- 
Best regards,
 Vitalii Tymchyshyn


Re: About initial token, autobootstraping and load balance

2012-01-14 Thread Віталій Тимчишин
Actually for me it seems that largest means with most data, not range, that
with replication involved makes the feature useless.

2012/1/13 David McNelis 

> The documentation for that section needs to be updated...
>
> What happens is that if you just autobootstrap without setting a token it
> will by default bisect the range of the largest node.
>
> So if you go through several iterations of adding nodes, then this is what
> you would see:
>
> Gen 1:
> Node A:  100% of tokens, token range 1-10 (for example)
>
> Gen 2:
> Node A: 50% of tokens  (1-5)
> Node B: 50% of tokens (6-10)
>
> Gen 3:
> Node A: 25% of tokens (1-2.5)
> Node B: 50% of tokens (6-10)
> Node C: 25% of tokens (2.6-5)
>
> In reality, what you'd want in gen 3 is every node to be 33%, but it would
> not be the case without setting the tokens to begin with.
>
> You'll notice that there are a couple of scripts available to generate a
> list of  initial tokens for your particular cluster size, then ever time
> you add a node you'll need to update all the nodes with new tokens in order
> to properly load balance.
>
> Does this make sense?
>
> Other folks, am I explaining this correctly?
>
> David
>
>
> 2012/1/13 Carlos Pérez Miguel 
>
>> Hello,
>>
>> I have a doubt about how initial token is determined. In Cassandra's
>> documentation it is said that it is better to manually configure the
>> initial token to each node in the system but also is said that if
>> initial token is not defined and autobootstrap is true, new nodes
>> choose initial token in order to better the load balance of the
>> cluster. But what happens if no initial token is chosen and
>> autobootstrap is not activated? How each node selects its initial
>> token to balance the ring?
>>
>> I ask this because I am making tests with a 20 nodes cassandra cluster
>> with cassandra 0.7.9. Any node has initial token, nor
>> autobootstraping. I restart the cluster with each test I want to make
>> and in the end the cluster is always well balanced.
>>
>> Thanks
>>
>> Carlos Pérez Miguel
>>
>
>


-- 
Best regards,
 Vitalii Tymchyshyn


Re: Cassandra OOM

2012-01-13 Thread Віталій Тимчишин
2012/1/4 Vitalii Tymchyshyn 

> 04.01.12 14:25, Radim Kolar написав(ла):
>
>  > So, what are cassandra memory requirement? Is it 1% or 2% of disk data?
>> It depends on number of rows you have. if you have lot of rows then
>> primary memory eaters are index sampling data and bloom filters. I use
>> index sampling 512 and bloom filters set to 4% to cut down memory needed.
>>
> I've raised index sampling and bloom filter setting seems not to be on
> trunk yet. For me memtables is what's eating heap :(
>
>
Hello, all.

I've found out and fixed the problem today (after one my node OOMed
constantly replaying heap on start-up). full-key deletes are not accounted
and so column families with delete-only operations are not flushed. Here is
Jira: https://issues.apache.org/jira/browse/CASSANDRA-3741 and my pull
request to fix it: https://github.com/apache/cassandra/pull/5

Best regards, Vitalii Tymchyshyn


Re: is it bad to have lots of column families?

2012-01-05 Thread Віталій Тимчишин
2012/1/5 Michael Cetrulo 

> in a traditional database it's not a good a idea to have hundreds of
> tables but is it also bad to have hundreds of column families in cassandra?
> thank you.
>

As far as I can see, this may raise memory requirements for you, since you
need to have index/bloom filter for each column family in memory.

-- 
Best regards,
 Vitalii Tymchyshyn


Cassandra OOM

2012-01-03 Thread Віталій Тимчишин
Hello.

We are using cassandra for some time in our project. Currently we are on
1.1 trunk (it was accidental migration, but since it's hard to migrate back
and it's performing nice enough we are currently on 1.1).
During New Year holidays one of the servers've produces a number of OOM
messages in the log.
According to heap dump taken, most of the memory is taken by MutationStage
queue (over 2millions of items).
So, I am curious now if cassandra have any flow control for messages? We
are using Quorum for writes and it seems to me that one slow server may
start getting more messages than it can consume. The writes will still
succeed performed by other servers in the replication set.
If there is no flow control, it should eventually get OOM. Is it the case?
Are there any plans to handle this?
BTW: A lot of memory (~half) is taken by Inet4Address objects, so making a
cache of such objects would make this problem less possible.

-- 
Best regards,
 Vitalii Tymchyshyn