from:"Terje Marthinussen"

Re: Disable FS journaling

2014-05-20 Thread Terje Marthinussen

Journal enabled is faster on almost all operations. 

Recovery here is more about saving you from waiting 1/2 hour from a traditional 
full file system check.

Feel free to wait if you want though! :)

Regards,
Terje

> On 21 May 2014, at 01:11, Paulo Ricardo Motta Gomes 
>  wrote:
> 
> Thanks for the links!
> 
> Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But 
> just double checked and it's apparently not possible to disable journaling on 
> XFS.
> 
> One of ours sysadmin just suggested disabling journaling, since it's mostly 
> for recovery purposes, and Cassandra already does that pretty well with 
> commitlog, replication and anti-entropy. It would anyway be nice to know if 
> there could be any performance benefits from it. But I personally don't think 
> it would help much, due to the append-only nature of cassandra writes.
> 
> 
>> On Tue, May 20, 2014 at 12:43 PM, Michael Shuler  
>> wrote:
>>> On 05/20/2014 09:54 AM, Samir Faci wrote:
>>> I'm not sure you'd be gaining much by doing this.  This is probably
>>> dependent on the file system you're referring to when you say
>>> journaling.  There's a few of them around,
>>> 
>>> You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
>>> google search linked me to this:
>> 
>> ext2/3 is not a good choice for file size limitation and performance reasons.
>> 
>> I started to search for a couple links, and a quick check of the links I 
>> posted a couple years ago seem to still be interesting  ;)
>> 
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E
>> 
>> (repost from above)
>> 
>> Hopefully this is some good reading on the topic:
>> 
>> https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user
>> 
>> one of the more interesting considerations:
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E
>> 
>> http://wiki.apache.org/cassandra/CassandraHardware
>> 
>> http://wiki.apache.org/cassandra/LargeDataSetConsiderations
>> 
>> http://www.datastax.com/dev/blog/questions-from-the-tokyo-cassandra-conference
>> 
>> -- 
>> Kind regards,
>> Michael
> 
> 
> 
> -- 
> Paulo Motta
> 
> Chaordic | Platform
> www.chaordic.com.br
> +55 48 3232.3200

Re: Cassandra as storage for cache data

2013-07-02 Thread Terje Marthinussen

If this is a tombstone problem as suggested by some, and it is ok to turn of 
replication as suggested by others, it may be an idea to do an optimization in 
cassandra where

if replication_factor < 1:
   do not create tombstones


Terje 


On Jul 2, 2013, at 11:11 PM, Dmitry Olshansky  
wrote:

> In our case we have continuous flow of data to be cached. Every second we're 
> receiving tens of PUT requests. Every request has 500Kb payload in average 
> and TTL about 20 minutes.
> 
> On the other side we have the similar flow of GET requests. Every GET request 
> is transformed to "get by key" query for cassandra.
> 
> This is very simple and straightforward solution:
> - one CF
> - one key that is directly corresponds to cache entry key
> - one value of type bytes that corresponds to cache entry payload
> 
> To be honest, I don't see how we can switch this solution to multi-CF scheme 
> playing with time-based snapshots.
> 
> Today this solution crashed again with overload symptoms:
> - almost non-stop compactifications on every node in cluster
> - large io-wait in the system
> - clients start failing with timeout exceptions
> 
> At the same time we see that cassandra uses only half of java heap. How we 
> can enforce it to start using all available resources (namely operating 
> memory)?
> 
> Best regards,
> Dmitry Olshansky

Re: Throughput decreases as latency increases with YCSB

2012-10-30 Thread Terje Marthinussen

Check how many concurrent real requests you have vs size of thread pools.

Regards,
Terje

On 30 Oct 2012, at 13:28, Peter Bailis  wrote:

>> I'm using YCSB on EC2 with one m1.large instance to drive client load
> 
> To add, I don't believe this is due to YCSB. I've done a fair bit of 
> client-side profiling and neither client CPU or NIC (or server NIC) are 
> bottlenecks.
> 
> I'll also add that this dataset fits in memory.
> 
> Thanks!
> Peter

Re: quick question about data layout on disk

2012-08-10 Thread Terje Marthinussen

Rowkey is stored only once in any sstable file.

That is, in the spesial case where you get sstable file per column/value, you 
are correct, but normally, I guess most of us are storing more per key.

Regards,
Terje

On 11 Aug 2012, at 10:34, Aaron Turner  wrote:

> Curious, but does cassandra store the rowkey along with every
> column/value pair on disk (pre-compaction) like Hbase does?  If so
> (which makes the most sense), I assume that's something that is
> optimized during compaction?
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/ Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & 
> Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>-- Benjamin Franklin
> "carpe diem quam minimum credula postero"

Re: Use of SSD for commitlog

2012-08-08 Thread Terje Marthinussen

Probably you can get an intel 320 160GB or a Samsung 830 for the same price as 
the 146GB 15k rpm drive.

Overprovision the SSD 20% and off you go.

It will beat the HDD both sequentially and randomly.

Terje

On Aug 8, 2012, at 11:41 PM, Amit Kumar  wrote:

> 
> There is a really good presentation about SSD and Cassandra on youtube by 
> Rick Branson. I highly recommend watching it.
> 
> http://www.youtube.com/watch?v=zQdDi9pdf3I
> 
> 
> Amit
> On Aug 8, 2012, at 6:23 AM, Hiller, Dean wrote:
> 
>> A 7.5 is probably fine and can still beat it as it is going to be the speed 
>> of writing not seeking(and I am not sure if they spec hard drives with a 
>> write time when not seeking….not sure).  Remember that drives are rated on 
>> how fast they spin…this disk should not be spinning a lot(in theory)…it is 
>> always writing.
>> 
>> Even on a read, it would do one seek and then do a very long sequential 
>> read(and I think that only happens on startup anyways).
>> 
>> Later,
>> Dean
>> 
>> From: Darvin Denmian 
>> mailto:darvin.denm...@gmail.com>>
>> Reply-To: "user@cassandra.apache.org" 
>> mailto:user@cassandra.apache.org>>
>> Date: Wednesday, August 8, 2012 7:16 AM
>> To: "user@cassandra.apache.org" 
>> mailto:user@cassandra.apache.org>>
>> Subject: Re: Use of SSD for commitlog
>> 
>> Thanks for your reply Dean,
>> 
>> considering your reply maybe I use a 15k RPM SCSI Disk, I think it'll 
>> perform better than a SSD disk.
>> 
>> On Wed, Aug 8, 2012 at 10:01 AM, Hiller, Dean 
>> mailto:dean.hil...@nrel.gov>> wrote:
>> Probably not since it is sequential writes….(ie. Seek performance is the big 
>> hit and if it is sequential it should not be seeking and is about just as 
>> fast as an SSD in theory).  In practice, I have not measure the performance 
>> of one vs. the other though…that I always the best way to go.(you could 
>> write a micro benchmark test with warmup writes and then stream writes to 
>> and see how it does without cassandra).
>> 
>> Dean
>> 
>> From: Darvin Denmian 
>> mailto:darvin.denm...@gmail.com>>>
>> Reply-To: 
>> "user@cassandra.apache.org>"
>>  
>> mailto:user@cassandra.apache.org>>>
>> Date: Tuesday, August 7, 2012 8:34 PM
>> To: 
>> "user@cassandra.apache.org>"
>>  
>> mailto:user@cassandra.apache.org>>>
>> Subject: Use of SSD for commitlog
>> 
>> Hi,
>> 
>> Can somebody tell me if is there some benefit in use SSD Disks
>> for commitlog?
>> 
>> THanks!
>> 
>

Re: Much more native memory used by Cassandra then the configured JVM heap size

2012-06-21 Thread Terje Marthinussen

We run some fairly large and busy Cassandra setups.
All of them without mmap.

I have yet to see a benchmark which conclusively can say mmap is better (or 
worse for that matter) than standard ways of doing I/O and we have done many of 
them last 2 years by different people, with different tools and with different 
HW.

My only conclusion is that not using mmap is easier to monitor and debug (you 
actually see what memory is used by Cassandra) and is more stable overall.

I highly recommend non-mmap setups

Regards,
Terje

On 22 Jun 2012, at 05:05, "Poziombka, Wade L"  
wrote:

> Just to close this item: with CASSANDRA-4314 applied I see no memory errors 
> (either Java heap or native heap).  Cassandra appears to be a hog with its 
> memory mapped files.  This caused us to wrongly think it was the culprit in a 
> severe native memory leak.  However, our leaky process was a different jsvc 
> process altogether.
>  
> I wanted to make sure I set the record straight and not leave the idea out 
> there that Cassandra may have a memory problem.
>  
> Wade Poziombka
> Intel Americas, Inc.
>  
>  
> From: Poziombka, Wade L [mailto:wade.l.poziom...@intel.com] 
> Sent: Wednesday, June 13, 2012 10:53 AM
> To: user@cassandra.apache.org
> Subject: RE: Much more native memory used by Cassandra then the configured 
> JVM heap size
>  
> Seems like my only recourse is to remove jna.jar and just take the 
> performance/swapping pain?
>  
> Obviously can’t have the entire box lock up.  I can provide a pmap etc. if 
> needed.
>  
> From: Poziombka, Wade L [mailto:wade.l.poziom...@intel.com] 
> Sent: Wednesday, June 13, 2012 10:28 AM
> To: user@cassandra.apache.org
> Subject: RE: Much more native memory used by Cassandra then the configured 
> JVM heap size
>  
> I have experienced the same issue.  The Java heap seems fine but eventually 
> the OS runs out of heap.  In my case it renders the entire box unusable 
> without a hard reboot.  Console shows:
>  
> is there a way to limit the native heap usage?
>  
> xfs invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
>  
> Call Trace:
>  [] out_of_memory+0x8e/0x2f3
>  [] __wake_up+0x38/0x4f
>  [] __alloc_pages+0x27f/0x308
>  [] __do_page_cache_readahead+0x96/0x17b
>  [] filemap_nopage+0x14c/0x360
>  [] __handle_mm_fault+0x1fd/0x103b
>  [] __wake_up+0x38/0x4f
>  [] do_page_fault+0x499/0x842
>  [] audit_filter_syscall+0x87/0xad
>  [] error_exit+0x0/0x84
>  
> Node 0 DMA per-cpu: empty
> Node 0 DMA32 per-cpu: empty
> Node 0 Normal per-cpu:
> cpu 0 hot: high 186, batch 31 used:23
> cpu 0 cold: high 62, batch 15 used:14
> …
> cpu 23 cold: high 62, batch 15 used:8
> Node 1 HighMem per-cpu: empty
> Free pages:  158332kB (0kB HighMem)
> Active:16225503 inactive:1 dirty:0 writeback:0 unstable:0 free:39583 
> slab:21496
> Node 0 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB
> present:0kB
> lowmem_reserve[]: 0 0 32320 32320
> Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB
> present:0
> lowmem_reserve[]: 0 0 32320 32320
> Node 0 Normal free:16136kB min:16272kB low:20340kB high:24408kB active:3255624
>  
>  
> From: aaron morton [mailto:aa...@thelastpickle.com] 
> Sent: Tuesday, June 12, 2012 4:08 AM
> To: user@cassandra.apache.org
> Subject: Re: Much more native memory used by Cassandra then the configured 
> JVM heap size
>  
> see http://wiki.apache.org/cassandra/FAQ#mmap
>  
> which cause the OS low memory.
> If the memory is used for mmapped access the os can get it back later. 
>  
> Is the low free memory causing a problem ?
>  
> Cheers
>  
>  
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>  
> On 12/06/2012, at 5:52 PM, Jason Tang wrote:
>  
> 
> Hi
>  
> I found some information of this issue
> And seems we can have other strategy for data access to reduce mmap usage, in 
> order to use less memory.
>  
> But I didn't find the document to describe the parameters for Cassandra 1.x, 
> is it a good way to use this parameter to reduce shared memory usage and 
> what's the impact? (btw, our data model is dynamical, which means the 
> although the through put is high, but the life cycle of the data is short, 
> one hour or less).
>  
> "
> # Choices are auto, standard, mmap, and mmap_index_only.
> disk_access_mode: auto
> "
>  
> http://comments.gmane.org/gmane.comp.db.cassandra.user/7390 
> 
> 2012/6/12 Jason Tang 
> See my post, I limit the HVM heap 6G, but actually Cassandra will use more 
> memory which is not calculated in JVM heap. 
>  
> I use top to monitor total memory used by Cassandra.
>  
> =
> -Xms6G -Xmx6G -Xmn1600M
>  
> 2012/6/12 Jeffrey Kesselman 
> Btw.  I suggest you spin up JConsole as it will give you much more detai kon 
> what your VM is actually doing.
> 
> 
>  
> On Mon, Jun 11, 2012 at 9:14 PM, Jason Tang  wrote:
> Hi
>  
> We have some problem with Cassandra memory usage, we configure the JVM HEAP 
> 6G, but after runing C

Re: two dimensional slicing

2012-01-29 Thread Terje Marthinussen

On Sun, Jan 29, 2012 at 7:26 PM, aaron morton wrote:

> and compare them, but at this point I need to focus on one to get
> things working, so I'm trying to make a best initial guess.
>
> I would go for RP then, BOP may look like less work to start with but it
> *will* bite you later. If you use an increasing version number as a key you
> will get a hot spot. Get it working with RP and Standard CF's, accept the
> extra lookups, and then see if where you are performance / complexity wise.
> Cassandra can be pretty fast.
>

Of course, there is no guarantee that it will bite you.

Whatever data hotspot you may get may very well be minor vs. the advantage
of slicing continous blocks of data on a single server vs. random bits and
pieces all over the place.

For instance, there are many large data repositories out there
of analytic data which only have a few queries per hour. BOP will most
likely have no performance at all for many of these, indeed, it may be much
faster than the alternatives.

BOP is very useful and powerful for many things and saves a fair chunk of
development time vs. the alternatives when you can use it.

If we really want everybody to stop using it, we should change cassandra so
it by default can provide the same function in some other way without
adding days and maybe weeks of development and extra complexity to your
project.

Terje

Re: What is the future of supercolumns ?

2012-01-06 Thread Terje Marthinussen

Please realize that I do not make any decisions here and I am not part of the 
core Cassandra developer team.

What has been said before is that they will most likely go away and at least 
under the hood be replaced by composite columns.

Jonathan have however stated that he would like the supercolumn API/abstraction 
to remain at least for backwards compatibility.

Please understand that under the hood, supercolumns are merely groups of 
columns serialized as a single block of data. 

The fact that there is a specialized and hardcoded way to serialize these 
column groups into supercolumns is a problem however and they should probably 
go away to make space for a more generic implementation allowing more flexible 
data structures and less code specific for one special data structure.

Today there are tons of extra code to deal with the slight difference in 
serialization and features of supercolumns vs columns and hopefully most of 
that could go away if things got structured a bit different.

I also hope that we keep APIs to allow simple access to groups of key/value 
pairs to simplify application logic as working with just columns can add a lot 
of application code which should not be needed.

If you almost always need all or mostly all of the columns in a supercolumn, 
and you normally update all of them at the same time, they will most likely be 
faster than normal columns.

Processing wise, you will actually do a bit more work on 
serialization/deserialization of SC's but the I/O part will usually be better 
grouped/require less operations.

I think we did some benchmarks on some heavy use cases with ~30 small columns 
per SC some time back and I think we ended up with  SCs being 10-20% faster.

Terje

On Jan 5, 2012, at 2:37 PM, Aklin_81 wrote:

> I have seen supercolumns usage been discouraged most of the times.
> However sometimes the supercolumns seem to fit the scenario most
> appropriately not only in terms of how the data is stored but also in
> terms of how is it retrieved. Some of the queries supported by SCs are
> uniquely capable of doing the task which no other alternative schema
> could do.(Like recently I asked about getting the equivalent of
> retrieving a list of (full)supercolumns by name, through use of
> composite columns, unfortunately there was no way to do this without
> reading lots of extra columns).
> 
> So I am really confused whether:
> 
> 1. Should I really not use the supercolumns for any case at all,
> however appropriate, or I just need to be just careful while realizing
> that supercolumns fit my use case appropriately or what!?
> 
> 2. Are there any performance concerns with supercolumns even in the
> cases where they are used most appropriately. Like when you need to
> retrieve the entire supercolumns everytime & max. no of subcolumns
> vary between 0-10.
> (I don't write all the subcolumns inside supercolumn, at once though!
> Does this also matter?)
> 
> 3. What is their future? Are they going to be deprecated or may be
> enhanced later?

Re: [RELEASE] Apache Cassandra 1.0.6 released

2011-12-16 Thread Terje Marthinussen

Works if you turn off mmap?

We run without mmap and see hardly any difference in performance, but with huge 
benefits in the form of a memory consumption which can actually be monitored 
easily and it just seem like things are more stable this way in general.  

Just turn off and see how that works!

Regards,
Terje

On 16 Dec 2011, at 18:39, Viktor Jevdokimov  
wrote:

> Created https://issues.apache.org/jira/browse/CASSANDRA-3642
> 
> -Original Message-
> From: Viktor Jevdokimov [mailto:viktor.jevdoki...@adform.com] 
> Sent: Thursday, December 15, 2011 18:26
> To: user@cassandra.apache.org
> Subject: RE: [RELEASE] Apache Cassandra 1.0.6 released
> 
> Cassandra 1.0.6 under Windows Server 2008 R2 64bit with disk acces mode 
> mmap_index_only failing to delete any *-Index.db files after compaction or 
> scrub:
> 
> ERROR 13:43:17,490 Fatal exception in thread Thread[NonPeriodicTasks:1,5,main]
> java.lang.RuntimeException: java.io.IOException: Failed to delete 
> D:\cassandra\data\data\system\LocationInfo-g-29-Index.db
>at 
> org.apache.cassandra.utils.FBUtilities.unchecked(FBUtilities.java:689)
>at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>at java.util.concurrent.FutureTask.run(Unknown Source)
>at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
>  Source)
>at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source)
>at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>at java.lang.Thread.run(Unknown Source) Caused by: 
> java.io.IOException: Failed to delete 
> D:\cassandra\data\data\system\LocationInfo-g-29-Index.db
>at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
>at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)
>at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:141)
>at 
> org.apache.cassandra.io.sstable.SSTableDeletingTask.runMayThrow(SSTableDeletingTask.java:81)
>at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>... 8 more
> 
> ERROR 17:20:09,701 Fatal exception in thread Thread[NonPeriodicTasks:1,5,main]
> java.lang.RuntimeException: java.io.IOException: Failed to delete 
> D:\cassandra\data\data\Keyspace1\ColumnFamily1-hc-840-Index.db
>at 
> org.apache.cassandra.utils.FBUtilities.unchecked(FBUtilities.java:689)
>at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>at java.util.concurrent.FutureTask.run(Unknown Source)
>at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
>  Source)
>at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source)
>at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>at java.lang.Thread.run(Unknown Source) Caused by: 
> java.io.IOException: Failed to delete D:\cassandra\data\data\ 
> Keyspace1\ColumnFamily1-hc-840-Index.db
>at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
>at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)
>at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:141)
>at 
> org.apache.cassandra.io.sstable.SSTableDeletingTask.runMayThrow(SSTableDeletingTask.java:81)
>at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>... 8 more
> 
> 
> 
> 
> Best regards/ Pagarbiai
> 
> Viktor Jevdokimov
> Senior Developer
> 
> Email: viktor.jevdoki...@adform.com
> Phone: +370 5 212 3063
> Fax: +370 5 261 0453
> 
> J. Jasinskio 16C,
> LT-01112 Vilnius,
> Lithuania
> 
> 
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.-Original Message-
> 
> From: Sylvain Lebresne [mailto:sylv...@datastax.com]
> Sent: Wednesday, December 14, 2011 20:23
> To: user@cassandra.apache.org
> Subject: [RELEASE] Apache

Re: Hinted handoff bug?

2011-12-01 Thread Terje Marthinussen

Sorry for not checking source to see if things have changed but i just 
remembered an issue I have forgotten to make jira for.

In old days, nodes would periodically try to deliver queues.

However, this was at some stage changed so it only deliver if a node is being 
marked up.

However, you can definitely have a scenario where  A fails to deliver to B so 
it send the hint to C instead.

However, B is not really down, it just could not accept that packet at that 
time and C always (correctly in this case) thinks B is up and it never tries to 
deliver the hints to B.

Will this change fix this, or do we need to get back the thread that 
periodically tried to deliver hints regardless of node status changes?

Regards,
Terje

On 1 Dec 2011, at 19:10, Sylvain Lebresne  wrote:

> You're right, good catch.
> Do you mind opening a ticket on jira
> (https://issues.apache.org/jira/browse/CASSANDRA)?
> 
> --
> Sylvain
> 
> On Thu, Dec 1, 2011 at 10:03 AM, Fredrik L Stigbäck
>  wrote:
>> Hi,
>> We,re running cassandra 1.0.3.
>> I've done some testing with 2 nodes (node A, node B), replication factor 2.
>> I take node A down, writing some data to node B and then take node A up.
>> Sometimes hints aren't delivered when node A comes up.
>> 
>> I've done some debugging in org.apache.cassandra.db.HintedHandOffManager and
>> sometimes node B ends up in a strange state in method
>> org.apache.cassandra.db.HintedHandOffManager.deliverHints(final InetAddress
>> to), where org.apache.cassandra.db.HintedHandOffManager.queuedDeliveries
>> already has node A in it's Set and therefore no hints will ever be delivered
>> to node A.
>> The only reason for this that I can see is that in
>> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(InetAddress
>> endpoint) the hintStore.isEmpty() check returns true and the endpoint (node
>> A)  isn't removed from
>> org.apache.cassandra.db.HintedHandOffManager.queuedDeliveries. Then no hints
>> will ever be delivered again until node B is restarted.
>> During what conditions will hintStore.isEmpty() return true?
>> Shouldn't the hintStore.isEmpty() check be inside the try {} finally{}
>> clause, removing the endpoint from queuedDeliveries in the finally block?
>> 
>> public void deliverHints(final InetAddress to)
>> {
>> logger_.debug("deliverHints to {}", to);
>> if (!queuedDeliveries.add(to))
>> return;
>> ...
>> }
>> 
>> private void deliverHintsToEndpoint(InetAddress endpoint) throws
>> IOException, DigestMismatchException, InvalidRequestException,
>> TimeoutException,
>> {
>> ColumnFamilyStore hintStore =
>> Table.open(Table.SYSTEM_TABLE).getColumnFamilyStore(HINTS_CF);
>> if (hintStore.isEmpty())
>> return; // nothing to do, don't confuse users by logging a no-op
>> handoff
>> try
>> {
>> ..
>> }
>> finally
>> {
>> queuedDeliveries.remove(endpoint);
>> }
>> }
>> 
>> Regards
>> /Fredrik

Re: hw requirements

2011-08-31 Thread Terje Marthinussen

SSD's definitely makes live simpler as you will get a lot less trouble with 
impact from  things like compactions.

Just beware that Cassandra expands data a lot due to storage overhead (for 
small columns), replication and needed space for compactions and repairs. 

It is well worth doing some real life testing here before you order a lot of HW 
!

With our tuning and load, Java vm cannot really use more than 12GB for heap 
before GC falls apart, so probably 24GB would be a nice memory size per server.

And yes, many small is better than a few large in most cases.

Terje

On Aug 31, 2011, at 10:27 AM, Maxim Potekhin wrote:

> Plenty of comments in this thread already, and I agree with those saying
> "it depends". From my experience, a cluster with 18 spindles total
> could not match the performance and throughput of our primary
> Oracle server which had 108 spindles. After we upgraded to SSD,
> things have definitely changed for the better, for Cassandra.
> 
> Another thing is that if you plan to implement "composite indexes" by
> catenating column values into additional columns, that would constitute
> a "write" hence you'll need CPU. So watch out.
> 
> 
> On 8/29/2011 9:15 AM, Helder Oliveira wrote:
>> Hello guys,
>> 
>> What is the type of profile of a cassandra server.
>> Are SSD an option ?
>> Does cassandra needs better CPU ou lots of memory ?
>> Are SATA II disks ok ?
>> 
>> I am making some tests, and i started evaluating the possible hardware.
>> 
>> If someone already has conclusions about it, please share :D
>> 
>> Thanks a lot.
>

Re: Using 5-6 bytes for cassandra timestamps vs 8…

2011-08-29 Thread Terje Marthinussen

I have a patch for trunk which I just have to get time to test a bit before I 
submit.

It is for super columns and will use the super columns timestamp as the base 
and only store variant encoded offsets in the underlying columns. 

If the timestamp equals that of the SC, it will store nothing (just set a bit 
in the serialization flag).

This could be further extended somehow to sstables or rows so there is a base 
time per sstable or row and just variant encoded offsets of that per column.

Terje

On Aug 29, 2011, at 3:58 PM, Kevin Burton wrote:

> I keep thinking about the usage of cassandra timestamps and feel that for a 
> lot of applications swallowing a 2-4x additional cost to to memory might be a 
> nonstarter.
> 
> Has there been any discussion of using alternative date encodings?
> 
> Maybe 1ms resolution is too high ….. perhaps 10ms resolution?  or even 100ms 
> resolution?
> 
> Using 4 bytes and 100ms resolution your can fit in 13 years of timestamps if 
> you use the time you deploy the cassandra DB (aka 'now') as epoch.
> 
> Even 5 bytes at 1ms resolution is 34 years.  
> 
> That's 37% less memory!  
> 
> In most of our applications, we would NEVER see concurrent writers on the 
> same key because we partition the jobs so that this doesn't happen.
> 
> I'd probably be fine with 100ms resolution.
> 
> Allowing the user to tune this would be interesting as well.
> 
> -- 
> Founder/CEO Spinn3r.com
> 
> Location: San Francisco, CA
> Skype: burtonator
> Skype-in: (415) 871-0687
>

Re: For multi-tenant, is it good to have a key space for each tenant?

2011-08-25 Thread Terje Marthinussen

Depends of course a lot on how many tenants you have.

Hopefully the new off heap memtables is 1.0 may help as well as java gc on 
large heaps is getting a much bigger issue than memory cost.

Regards,
Terje

On 25 Aug 2011, at 14:20, Himanshi Sharma  wrote:

> 
> I am working on similar sort of stuff. As per my knowledge, creating keyspace 
> for each tenant would impose lot of memory constraints. 
> 
> Following Shared Keyspace and Shared Column families would be a better 
> approach. And each row in CF could be referred by tenant_id as row key. 
> And again it depends on the type of application. 
> 
> Hey this is just a suggestion, m not completely sure.. :) 
> 
> 
> Himanshi Sharma 
> 
> 
> 
> 
> From: Guofeng Zhang 
> To:   user@cassandra.apache.org
> Date: 08/25/2011 10:38 AM
> Subject:  For multi-tenant, is it good to have a key space for each 
> tenant?
> 
> 
> 
> 
> I wonder if it is a good practice to create a key space for each tenant. Any 
> advice is appreciated. 
> 
> Thanks 
> 
> 
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
>

Re: memory_locking_policy parameter in cassandra.yaml for disabling swap - has this variable been renamed?

2011-07-29 Thread Terje Marthinussen

On Fri, Jul 29, 2011 at 6:29 AM, Peter Schuller  wrote:

> > I would love to understand how people got to this conclusion however and
> try to find out why we seem to see differences!
>
> I won't make any claims with Cassandra because I have never bothered
> benchmarking the different in CPU usage since all my use-cases have
> been more focused on I/O efficiency, but I will say, without having
> benchmarked that either, the *generally*, if you're doing small reads
> of data that is in page cache using mmap() - something would have to
> be seriously wrong for that not to be significantly faster than
> regular I/O.
>
>
Sorry,  with small reads, I was thinking small random reads, basically
things that are not very cacheable and probably cause demand paging.
For quite large reads like 10s of MB from disk, the demand paging will not
be good for mmap performance. This is probably not a type of storage use
which is a stronghold of cassandra either.

But you sort of nicely list a lot of things I did not take time to write and
just add support for my original question: "What is the origin of the mmap
is substantially faster" claim?

You also need to throw in also throw in the fun question on how the jvm will
interact with all of this.

Given the amount of people asking question here related to confusion on
mmap, memory map and jna, and the work of maintaining mmap code, I am
somewhat curious if this is worth it.

Different usages can generate vastly different loads on systems, so just
because our current usage scenarios does not seem to benefit from mmap,
other cases obviously can and I am curious what these cases look like.

Terje

Re: memory_locking_policy parameter in cassandra.yaml for disabling swap - has this variable been renamed?

2011-07-28 Thread Terje Marthinussen

Benchmarks was done with up to 96GB memory, much more caching than most people 
will ever have.

The point anyway is that you are talking I/O in 10's or at best, a few hundred 
MB/sec before cassandra will eat all your CPU (with dual CPU 6 cores in our 
case).

The memcopy involved here deep inside the kernel will not be very high on the 
list of expensive operations.

The assumption also seems to be that mmap is "free" cpu wise. 
It clearly isn't. There is definitely work involved for the CPU also when doing 
mmap. It is just that you move it from context switching and small I/O buffer 
copying to memory management.

Terje

On Jul 29, 2011, at 5:16 AM, Jonathan Ellis wrote:

> If you're actually hitting disk for most or even many of your reads then mmap 
> doesn't matter since the extra copy to a Java buffer is negligible compared 
> to the i/o itself (even on ssds). 
> On Jul 28, 2011 9:04 AM, "Terje Marthinussen"  wrote:
> > 
> > On Jul 28, 2011, at 9:52 PM, Jonathan Ellis wrote:
> > 
> >> This is not advisable in general, since non-mmap'd I/O is substantially 
> >> slower.
> > 
> > I see this again and again as a claim here, but it is actually close to 10 
> > years since I saw mmap'd I/O have any substantial performance benefits on 
> > any real life use I have needed.
> > 
> > We have done a lot of testing of this also with cassandra and I don't see 
> > anything conclusive. We have done as many test where normal I/O has been 
> > faster than mmap and the differences may very well be within statistical 
> > variances given the complexity and number of factors involved in something 
> > like a distributed cassandra working at quorum.
> > 
> > mmap made a difference in 2000 when memory throughput was still measured in 
> > hundreds of megabytes/sec and cpu caches was a few kilobytes, but today, 
> > you got megabytes of CPU caches with 100GB/sec bandwidths and even memory 
> > bandwidths are in 10's of GB/sec.
> > 
> > However, I/O buffers are generally quiet small and copying an I/O buffer 
> > from kernel to user space inside a cache with 100GB/sec bandwidth is really 
> > a non-issue given the I/O throughput cassandra generates.
> > 
> > In 2005 or so, CPUs had already reached a limit where I saw that mmap 
> > performed worse than regular I/O on as a large number of use cases. 
> > 
> > Hard to say exactly why, but I saw one theory from a FreeBSD core developer 
> > speculating back then that the extra MMU work involved in some I/O loads 
> > may actually be slower than cache internal memcopy of tiny I/O buffers 
> > (they are pretty small after all).
> > 
> > I don't have a personal theory here. I just know that especially on large 
> > amounts of smaller I/O operations regular I/O was typically faster than 
> > mmap, which could back up that theory.
> > 
> > So, I wonder how people came to this conclusion as I am, under no real life 
> > use case with cassandra, able to reproduce anything resembling a 
> > significant difference and we have been benchmarking on nodes with ssd 
> > setups which can churn out 1GB/sec+ read speeds. 
> > 
> > Way more I/O throughput than most people have at hand and still I cannot 
> > get mmap to give me better performance.
> > 
> > I do, although subjectively, feel that things just seem to work better with 
> > regular I/O for us. We have currently have very nice and stable heap sizes 
> > at regardless of I/O loads and we have an easier system to operate as we 
> > can actually monitor how much memory the darned thing work.
> > 
> > My recommendation? Stay away from mmap.
> > 
> > I would love to understand how people got to this conclusion however and 
> > try to find out why we seem to see differences!
> > 
> >> The OP is correct that it is best to disable swap entirely, and
> >> second-best to enable JNA for mlockall.
> > 
> > Be a bit careful with removing swap completely. Linux is not always happy 
> > when it gets short on memory.
> > 
> > Terje

Re: memory_locking_policy parameter in cassandra.yaml for disabling swap - has this variable been renamed?

2011-07-28 Thread Terje Marthinussen

On Jul 28, 2011, at 9:52 PM, Jonathan Ellis wrote:

> This is not advisable in general, since non-mmap'd I/O is substantially 
> slower.

I see this again and again as a claim here, but it is actually close to 10 
years since I saw mmap'd I/O have any substantial performance benefits on any 
real life use I have needed.

We have done a lot of testing of this also with cassandra and I don't see 
anything conclusive. We have done as many test where normal I/O has been faster 
than mmap and the differences may very well be within statistical variances 
given the complexity and number of factors involved in something like a 
distributed cassandra working at quorum.

mmap made a difference in 2000 when memory throughput was still measured in 
hundreds of megabytes/sec and cpu caches was a few kilobytes, but today, you 
got megabytes of CPU caches with 100GB/sec bandwidths and even memory 
bandwidths are in 10's of GB/sec.

However, I/O buffers are generally quiet small and copying an I/O  buffer from 
kernel to user space inside a cache with 100GB/sec bandwidth is really  a 
non-issue given the I/O throughput cassandra generates.

In 2005 or so, CPUs had already reached a limit where I saw that mmap performed 
worse than regular I/O on as a large number of use cases. 

Hard to say exactly why, but I saw one theory from a FreeBSD core developer 
speculating back then that the extra MMU work involved in some I/O loads may 
actually be slower than cache internal memcopy of tiny I/O buffers (they are 
pretty small after all).

I don't have a personal theory here. I just know that especially on large 
amounts of smaller I/O operations regular I/O was typically faster than mmap, 
which could back up that theory.

So, I wonder how people came to this conclusion as I am, under no real life use 
case with cassandra, able to reproduce anything resembling a significant 
difference and we have been benchmarking on nodes with ssd setups which can 
churn out 1GB/sec+ read speeds. 

Way more I/O throughput than most people have at hand and still I cannot get 
mmap to give me better performance.

I do, although subjectively, feel that things just seem to work better with 
regular I/O for us. We have currently have very nice and stable heap sizes at 
regardless of I/O loads and we have an easier system to operate as we can 
actually monitor how much memory the darned thing work.

My recommendation? Stay away from mmap.

I would love to understand how people got to this conclusion however and try to 
find out why we seem to see differences!

> The OP is correct that it is best to disable swap entirely, and
> second-best to enable JNA for mlockall.

Be a bit careful with removing swap completely. Linux is not always happy when 
it gets short on memory.

Terje

Re: Repair doesn't work after upgrading to 0.8.1

2011-06-30 Thread Terje Marthinussen

Unless it is a 0.8.1 RC or beta

On Fri, Jul 1, 2011 at 12:57 PM, Jonathan Ellis  wrote:

> This isn't 2818 -- (a) the 0.8.1 protocol is identical to 0.8.0 and
> (b) the whole cluster is on the same version.
>
> On Thu, Jun 30, 2011 at 9:35 PM, aaron morton 
> wrote:
> > This seems to be a known issue related
> > to https://issues.apache.org/jira/browse/CASSANDRA-2818 e.g.
> https://issues.apache.org/jira/browse/CASSANDRA-2768
> > There was some discussion on the IRC list today, driftx said the simple
> fix
> > was a full cluster restart. Or perhaps a rolling restart with the 2818
> patch
> > applied may work.
> > Starting with "Dcassandra.load_ring_state=false" causes the node to
> > rediscover the ring which may help (just a guess really). But if there is
> > bad node start been passed around in gossip it will just get the bad
> state
> > again.
> > Anyone else ?
> >
> > -
> > Aaron Morton
> > Freelance Cassandra Developer
> > @aaronmorton
> > http://www.thelastpickle.com
> > On 1 Jul 2011, at 09:11, Héctor Izquierdo Seliva wrote:
> >
> > Hi all,
> >
> > I have upgraded all my cluster to 0.8.1. Today one of the disks in one
> > of the nodes died. After replacing the disk I tried running repair, but
> > this message appears:
> >
> > INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30
> > 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.80
> > from repair because it is on version 0.7 or sooner. You should consider
> > updating this node before running repair again.
> > INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30
> > 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.76
> > from repair because it is on version 0.7 or sooner. You should consider
> > updating this node before running repair again.
> > INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30
> > 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.80
> > from repair because it is on version 0.7 or sooner. You should consider
> > updating this node before running repair again.
> > INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30
> > 20:36:25,086 AntiEntropyService.java (line 179) Excluding /10.20.13.77
> > from repair because it is on version 0.7 or sooner. You should consider
> > updating this node before running repair again.
> > INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30
> > 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.76
> > from repair because it is on version 0.7 or sooner. You should consider
> > updating this node before running repair again.
> > INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30
> > 20:36:25,086 AntiEntropyService.java (line 782) No neighbors to repair
> > with for sbs on
> >
> (170141183460469231731687303715884105727,28356863910078205288614550619314017621]:
> > manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098 completed.
> > INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30
> > 20:36:25,086 AntiEntropyService.java (line 179) Excluding /10.20.13.79
> > from repair because it is on version 0.7 or sooner. You should consider
> > updating this node before running repair again.
> > INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30
> > 20:36:25,086 AntiEntropyService.java (line 782) No neighbors to repair
> > with for sbs on
> >
> (141784319550391026443072753096570088105,170141183460469231731687303715884105727]:
> > manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf completed.
> > INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30
> > 20:36:25,087 AntiEntropyService.java (line 782) No neighbors to repair
> > with for sbs on
> >
> (113427455640312821154458202477256070484,141784319550391026443072753096570088105]:
> > manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a completed.
> >
> > What can I do?
> >
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Alternative Row Cache Implementation

2011-06-30 Thread Terje Marthinussen

We had a visitor from Intel a month ago.

One question from him was "What could you do if we gave you a server 2 years
from now that had 16TB of memory"

I went Eh... using Java?

2 years is maybe unrealistic, but you can already get some quite acceptable
prices even on servers in the 100GB memory range now if you buy in larger
quantities (30-50 servers and more in one go).

I don't think it is unrealistic that we will start seeing high end consumer
(x64) servers with TB's of memory in a few years and I really wonder were
that puts java based software.

Terje

On Fri, Jul 1, 2011 at 2:25 AM, Edward Capriolo wrote:

>
>
> On Thu, Jun 30, 2011 at 12:44 PM, Daniel Doubleday <
> daniel.double...@gmx.net> wrote:
>
>> Hi all - or rather devs
>>
>> we have been working on an alternative implementation to the existing row
>> cache(s)
>>
>> We have 2 main goals:
>>
>> - Decrease memory -> get more rows in the cache without suffering a huge
>> performance penalty
>> - Reduce gc pressure
>>
>> This sounds a lot like we should be using the new serializing cache in
>> 0.8.
>> Unfortunately our workload consists of loads of updates which would
>> invalidate the cache all the time.
>>
>> The second unfortunate thing is that the idea we came up with doesn't fit
>> the new cache provider api...
>>
>> It looks like this:
>>
>> Like the serializing cache we basically only cache the serialized byte
>> buffer. we don't serialize the bloom filter and try to do some other minor
>> compression tricks (var ints etc not done yet). The main difference is that
>> we don't deserialize but use the normal sstable iterators and filters as in
>> the regular uncached case.
>>
>> So the read path looks like this:
>>
>> return filter.collectCollatedColumns(memtable iter, cached row iter)
>>
>> The write path is not affected. It does not update the cache
>>
>> During flush we merge all memtable updates with the cached rows.
>>
>> These are early test results:
>>
>> - Depending on row width and value size the serialized cache takes between
>> 30% - 50% of memory compared with cached CF. This might be optimized further
>> - Read times increase by 5 - 10%
>>
>> We haven't tested the effects on gc but hope that we will see improvements
>> there because we only cache a fraction of objects (in terms of numbers) in
>> old gen heap which should make gc cheaper. Of course there's also the option
>> to use native mem like serializing cache does.
>>
>> We believe that this approach is quite promising but as I said it is not
>> compatible with the current cache api.
>>
>> So my question is: does that sound interesting enough to open a jira or
>> has that idea already been considered and rejected for some reason?
>>
>> Cheers,
>> Daniel
>>
>
>
>
> The problem I see with the row cache implementation is more of a JVM
> problem. This problem is not Cassandra localized (IMHO) as I hear Hbase
> people with similar large cache/ Xmx issues. Personally, I feel this is a
> sign of Java showing age. "Let us worry about the pointers" was a great
> solution when systems had 32MB memory, because the cost of walking the
> object graph was small and possible and small time windows. But JVM's
> already can not handle 13+ GB of RAM and it is quite common to see systems
> with 32-64GB physical memory. I am very curious to see how java is going to
> evolve on systems with 128GB or even higher memory.
>
> The G1 collector will help somewhat, however I do not see that really
> pushing Xmx higher then it is now. HBase has even went the route of using an
> off heap cache, https://issues.apache.org/jira/browse/HBASE-4018 , and
> some Jira mentions Cassandra exploring this alternative as well.
>
> Doing whatever possible to shrink the current size of item in cache is an
> awesome. Anything that delivers more bang for the buck is +1. However I feel
> that VFS cache is the only way to effectively cache large datasets. I was
> quite disappointed when I upped a machine from 16GB to 48 GB physical
> memory. I said to myself "Awesome! now I can shave off a couple of GB for
> larger row caches" I changed Xmx from 9GB to 13GB, upped the caches, and
> restarted. I found the system spending a lot of time managing heap, and also
> found that my compaction processes that did 200GB in 4 hours now were taking
> 6 or 8 hours.
>
> I had heard that JVMs "top out around 20GB" but I found they "top out" much
> lower. VFS cache +1
>
>
>

Re: Re : get_range_slices result

2011-06-30 Thread Terje Marthinussen

It should of course be noted that how hard it is to load balance depends a
lot on your dataset

Some datasets load balances reasonably well even when ordered and use of the
OPP is not a big problem at all (on the contrary) and in quite a few use
cases with current HW, read performance really isn't your problem in any
case.

You may for instance find it more useful to simplify adding nodes for
growing data capacity to the "end" of the token range using OPP than getting
extra performance you don't really need.

Terje

On Fri, Jun 24, 2011 at 7:16 PM, Sylvain Lebresne wrote:

> On Fri, Jun 24, 2011 at 10:21 AM, karim abbouh  wrote:
> > i want get_range_slices() function returns records sorted(orded)  by the
> > key(rowId) used during the insertion.
> > is it possible?
>
> You will have to use the OrderPreservingPartitioner. This is no
> without inconvenience however.
> See for instance
> http://wiki.apache.org/cassandra/StorageConfiguration#line-100 or
>
> http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
> that give more details on the pros and cons (the short version being
> that the main advantage of
> OrderPreservingPartitioner is what you're asking for, but it's main
> drawback is that load-balancing
> the cluster will likely be very very hard).
>
> In general the advice is to stick with RandomPartitioner and design a
> data model that avoids needing
> range slices (or at least needing that the result is sorted). This is
> very often not too hard and more
> efficient, and much more simpler than to deal with the load balancing
> problems of OrderPreservingPartitioner.
>
> --
> Sylvain
>
> >
> > 
> > De : aaron morton 
> > À : user@cassandra.apache.org
> > Envoyé le : Jeudi 23 Juin 2011 20h30
> > Objet : Re: get_range_slices result
> >
> > Not sure what your question is.
> > Does this help ? http://wiki.apache.org/cassandra/FAQ#range_rp
> > Cheers
> > -
> > Aaron Morton
> > Freelance Cassandra Developer
> > @aaronmorton
> > http://www.thelastpickle.com
> > On 23 Jun 2011, at 21:59, karim abbouh wrote:
> >
> > how can get_range_slices() function returns sorting key ?
> > BR
> >
> >
> >
> >
>

Re: RAID or no RAID

2011-06-27 Thread Terje Marthinussen

If you have a quality HW raid controller with proper performance (and far from 
all have good performance) you cam definitely benefit from a battery backed up 
write cache on it, although the benefits will not be huge on raid 0.

Unless you get a really good price on that high performance  HW raid with 
battery backup, it is probably not worth it for raid 0.

When that is said, raid 5 is pretty speedy as well with a good controller with 
battery cache so don't rule that out if you have the controller anyway and may 
save you from some manual recover operations..

Regards,
Terje

On 28 Jun 2011, at 08:46, mcasandra  wrote:

> Which one is preferred RAID0 or spreading data files accross various disks on
> the same node? I like RAID0 but what would be the most convincing argument
> to put additional RAID controller card in the machine?
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RAID-or-no-RAID-tp6522904p6522904.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.

Re: Cassandra ACID

2011-06-26 Thread Terje Marthinussen

>
> That being said, we do not provide isolation, which means in particular
> that
> reads *can* return a state where only parts of a batch update seems applied
> (and it would clearly be cool to have isolation and I'm not even
> saying this will
> never happen).


Out of curiosity, do you see any architectural issues that makes this one
hard to do (given the limitations already in place for atomicity)  or is it
more a case of "its just that nobody has put it high enough on their
priority list to do it yet?"

Terje

snitch & thrift

2011-06-16 Thread Terje Marthinussen

Hi all!

Assuming a node ends up in GC land for a while, there is a good chance that
even though it performs terribly and the dynamic snitching will help you to
avoid it on the gossip side, it will not really help you much if thrift
still accepts requests and the thrift interface has choppy performance.

This makes me wonder if it is a potential idea with thrift only client mode
nodes.

I don't think I have seen that this exists today (or is it possible that I
have missed a way to configure that?), but it does not seem like a very hard
thing to make and could maybe be good in some usage patterns for the
datanode as well as the thrift side.

Any thoughts?

Regards,
Terje

Re: Forcing Cassandra to free up some space

2011-06-15 Thread Terje Marthinussen

Watching this on a node here right now and it sort of shows how bad this can
get.
This node still has 109GB free disk by the way...

INFO [CompactionExecutor:5] 2011-06-16 09:11:59,164 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:5] 2011-06-16 09:12:23,929 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:5] 2011-06-16 09:12:46,489 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:17:53,299 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:18:17,782 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:18:42,078 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:19:06,984 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:19:32,079 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:19:57,265 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:20:22,706 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:20:47,331 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:21:13,062 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:21:38,288 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:22:03,500 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:22:29,407 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:22:55,577 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:23:20,951 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:23:46,448 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:3] 2011-06-16 09:24:12,030 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [ScheduledTasks:1] 2011-06-16 09:29:29,494 GCInspector.java (line 128)
GC for ParNew: 392 ms, 398997776 reclaimed leaving 2334786808 used; max is
10844635136
 INFO [ScheduledTasks:1] 2011-06-16 09:29:32,831 GCInspector.java (line 128)
GC for ParNew: 737 ms, 332336832 reclaimed leaving 2473311448 used; max is
10844635136
 INFO [CompactionExecutor:6] 2011-06-16 09:48:00,633 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:6] 2011-06-16 09:48:26,119 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:6] 2011-06-16 09:48:49,002 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:6] 2011-06-16 10:10:20,196 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:6] 2011-06-16 10:10:45,322 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:6] 2011-06-16 10:11:07,619 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:7] 2011-06-16 11:01:45,562 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:7] 2011-06-16 11:02:10,236 StorageService.java
(line 2071) requesting GC to free disk space
 INFO [CompactionExecutor:7] 2011-06-16 11:05:31,297 StorageService.java
(line 2071) requesting GC to free disk space

If I look at the data dir, I see 46 *Compacted files which makes up an
additional 137GB of space.
The oldest of these Compacted files dates back to Jun 16th 01:26.

If these got deleted, there should actually be enough disk for the node to
run a full compaction run if needed.

Either the GC cleanup tactic is seriously flawed or  we have a potential bug
keeping references far longer than needed?

Terje



On Wed, Jun 15, 2011 at 11:50 PM, Shotaro Kamio  wrote:

> We've encountered the situation that compacted sstable files aren't
> deleted after node repair. Even when gc is triggered via jmx, it
> sometimes leaves compacted files. In a case, a lot of files are left.
> Some files stay more than 10 hours already. There is no guarantee that
> gc will cleanup all compacted sstable files.
>
> We have a great interest on the following ticket.
> https://issues.apache.org/jira/browse/CASSANDRA-2521
>
>
> Regards,
> Shotaro
>
>
> On Fri, May 27, 2011 at 11:27 AM, Jeffrey Kesselman 
> wrote:
> > Im also not sure that will guarantee all space is cleaned up.  It
> > really depends on what you are doing inside Cassandra.  If you have
> > your on garbage collect that is just in some way tied to the gc run,
> > then it will run wh

Re: downgrading from cassandra 0.8 to 0.7.3

2011-06-15 Thread Terje Marthinussen

Can't help you with that.
You may have to go the json2sstable route and re-import into 0.7.3

But... why would you want to go back to 0.7.3?

Terje

On Thu, Jun 16, 2011 at 10:30 AM, Anurag Gujral wrote:

> Hi All,
>   I moved to cassandra 0.8.0 from cassandra-0.7.3 when I  try to
> move back I get the following error:
> java.lang.RuntimeException: Can't open sstables from the future! Current
> version f, found file: /data/cassandra/data/system/Schema-g-9.
>
> Please suggest.
>
> Thanks
> Anurag
>
>
>
>
>

Re: What triggers hint delivery?

2011-06-15 Thread Terje Marthinussen

I suspect a few possibilities:
1. I have not checked, but what happens (in terms of hint delivery) if a
node tries to write something but the write times out even if the node is
marked as up?
2. I would assume there can be ever so slight variations in how different
nodes in the cluster think the rest of the cluster is up. These events will
of course typically  be short lived (unless some sort of long term split
brain situation occurs), but if you are writing data while for instance a
node is restarting, I would not be surprised if there are race conditions
where A see B as down, sends a hint to C but C already think B is up
3. I have observed situations where it seems like a node comes in up state
but for some reason takes a while to get really operational. Hint delivery
fails, the hint sender gives up and nothing more happens.

May be an idea to let a node check if it has hints on heartbeats maybe
(potentially not all of them, but at a regular interval)?

Terje

On Thu, Jun 16, 2011 at 2:08 AM, Jonathan Ellis  wrote:

> On Wed, Jun 15, 2011 at 10:53 AM, Terje Marthinussen
>  wrote:
> > I was looking quickly at source code tonight.
> > As far as I could see from a quick code scan, hint delivery is only
> > triggered as a state change from a node is down to when it enters up
> state?
>
> Right.
>
> > If this is indeed the case, it would potentially explain why we sometimes
> > have hints on machines which does not seem to get played back
>
> Why is that?  Hints don't get created in the first place unless a node
> is in the down state.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

What triggers hint delivery?

2011-06-15 Thread Terje Marthinussen

Hi,

I was looking quickly at source code tonight.
As far as I could see from a quick code scan, hint delivery is only
triggered as a state change from a node is down to when it enters up state?

If this is indeed the case, it would potentially explain why we sometimes
have hints on machines which does not seem to get played back, but I got a
feeling I must have been missing something  when I scanned the code :)

Terje

Re: Forcing Cassandra to free up some space

2011-06-15 Thread Terje Marthinussen

On Thu, Jun 16, 2011 at 12:48 AM, Terje Marthinussen <
tmarthinus...@gmail.com> wrote:

> Even if the gc call cleaned all files, it is not really acceptable on a
> decent sized cluster due to the impact full gc has on performance.
> Especially non-needed ones.
>
>
Not acceptable as running GC on every node in the cluster will further
increase the time period when you have degraded performance.

Terje

Re: Forcing Cassandra to free up some space

2011-06-15 Thread Terje Marthinussen

Even if the gc call cleaned all files, it is not really acceptable on a
decent sized cluster due to the impact full gc has on performance.
Especially non-needed ones.

The delay in file deletion can also at times make it hard to see how much
spare disk you actually have.

We easily see 100% increase in disk use which extends for long periods of
time before anything gets cleaned up. This can be quite misleading and I
believe on a couple of occasions we seen short term full disk scenarios
during testing as a result of cleanup not happening entirely when it
should...

Terje

On Wed, Jun 15, 2011 at 11:50 PM, Shotaro Kamio  wrote:

> We've encountered the situation that compacted sstable files aren't
> deleted after node repair. Even when gc is triggered via jmx, it
> sometimes leaves compacted files. In a case, a lot of files are left.
> Some files stay more than 10 hours already. There is no guarantee that
> gc will cleanup all compacted sstable files.
>
> We have a great interest on the following ticket.
> https://issues.apache.org/jira/browse/CASSANDRA-2521
>
>
> Regards,
> Shotaro
>
>
> On Fri, May 27, 2011 at 11:27 AM, Jeffrey Kesselman 
> wrote:
> > Im also not sure that will guarantee all space is cleaned up.  It
> > really depends on what you are doing inside Cassandra.  If you have
> > your on garbage collect that is just in some way tied to the gc run,
> > then it will run when  it runs.
> >
> > If otoh you are associating records in your storage with specific
> > objects in memory and using one of the post-mortem hooks (finalize or
> > PhantomReference) to tell you to clean up that particular record then
> > its quite possible they wont all get cleaned up.  In general hotspot
> > does not find and clean every candidate object on every GC run.  It
> > starts with the easiest/fastest to find and then sees what more it
> > thinks it needs to do to create enough memory for anticipated near
> > future needs.
> >
> > On Thu, May 26, 2011 at 10:16 PM, Jonathan Ellis 
> wrote:
> >> In summary, system.gc works fine unless you've deliberately done
> >> something like setting the -XX:-DisableExplicitGC flag.
> >>
> >> On Thu, May 26, 2011 at 5:58 PM, Konstantin  Naryshkin
> >>  wrote:
> >>> So, in summary, there is no way to predictably and efficiently tell
> Cassandra to get rid of all of the extra space it is using on disk?
> >>>
> >>> - Original Message -
> >>> From: "Jeffrey Kesselman" 
> >>> To: user@cassandra.apache.org
> >>> Sent: Thursday, May 26, 2011 8:57:49 PM
> >>> Subject: Re: Forcing Cassandra to free up some space
> >>>
> >>> Which JVM?  Which collector?  There have been and continue to be many.
> >>>
> >>> Hotspot itself supports a number of different collectors with
> >>> different behaviors.   Many of them do not collect every candidate on
> >>> every gc, but merely the easiest ones to find.  This is why depending
> >>> on finalizers is a *bad* idea in java code.  They may well never get
> >>> run.  (Finalizer is one of a few features the Sun Java team always
> >>> regretted putting in Java to start with.  It has caused quite a few
> >>> application problems over the years)
> >>>
> >>> The really important thing is that NONE of these behaviors of the
> >>> colelctors are guaranteed by specification not to change from version
> >>> to version.  Basing your code on non-specified behaviors is a good way
> >>> to hit mysterious failures on updates.
> >>>
> >>> For instance, in the mid 90s, IBM had a mode of their Vm called
> >>> "infinite heap."  it *never* garbage collected, even if you called
> >>> System.gc.  Instead it just threw away address space and counted on
> >>> the total memory needs for the life of the program being less then the
> >>> total addressable space of the processor.
> >>>
> >>> It was *very* fast for certain kinds of applications.
> >>>
> >>> Far from being pedantic, not depending on undocumented behavior is
> >>> simply good engineering.
> >>>
> >>>
> >>> On Thu, May 26, 2011 at 4:51 PM, Jonathan Ellis 
> wrote:
>  I've read the relevant source. While you're pedantically correct re
>  the spec, you're wrong as to what the JVM actually does.
> 
>  On Thu, May 26, 2011 at 3:14 PM, Jeffrey Kesselman 
> wrote:
> > Some references...
> >
> > "An object enters an unreachable state when no more strong references
> > to it exist. When an object is unreachable, it is a candidate for
> > collection. Note the wording: Just because an object is a candidate
> > for collection doesn't mean it will be immediately collected. The JVM
> > is free to delay collection until there is an immediate need for the
> > memory being consumed by the object."
> >
> >
> http://java.sun.com/docs/books/performance/1st_edition/html/JPAppGC.fm.html#998394
> >
> > and "Calling the gc method suggests that the Java Virtual Machine
> > expend effort toward recycling unused objects"
> >
> >
> http://download.oracle.com/javase/6

Re: repair and amount of transfers

2011-06-14 Thread Terje Marthinussen

Ah..

I just found Cassandra-2698 (I thought I had seen something about this)...

I guess that means I have too see if I can find time to investigate if I
have a reproducible case?

Terje

On Tue, Jun 14, 2011 at 4:21 PM, Terje Marthinussen  wrote:

> Hi,
>
> I have been testing repairs a bit in different ways on 0.8.0 and I am
> curious on what to really expect in terms of data transferred.
>
> I would expect my data to be fairly consistent in this case from the start.
> More than a billion supercolumns, but there was no errors in feed and we
> have seen minimal amounts of read repair going on while doing a complete
> scan of the data for consistency checking. As such, I would also expect
> repair to finish reasonably fast.
>
> On some nodes, it finishes in a couple of hours, but other nodes it is
> taking more than 12 hours and I see some 65GB of data streamed to the node
> which surprises me as I am pretty sure that it is not that out of sync.
>
> Not sure how much the merkle trees are actually reducing what needs to be
> streamed though.
>
> What should we expect to see if this works?
>
> Regards,
> Terje
>

repair and amount of transfers

2011-06-14 Thread Terje Marthinussen

Hi,

I have been testing repairs a bit in different ways on 0.8.0 and I am
curious on what to really expect in terms of data transferred.

I would expect my data to be fairly consistent in this case from the start.
More than a billion supercolumns, but there was no errors in feed and we
have seen minimal amounts of read repair going on while doing a complete
scan of the data for consistency checking. As such, I would also expect
repair to finish reasonably fast.

On some nodes, it finishes in a couple of hours, but other nodes it is
taking more than 12 hours and I see some 65GB of data streamed to the node
which surprises me as I am pretty sure that it is not that out of sync.

Not sure how much the merkle trees are actually reducing what needs to be
streamed though.

What should we expect to see if this works?

Regards,
Terje

Re: insufficient space to compact even the two smallest files, aborting

2011-06-13 Thread Terje Marthinussen

That most likely happened just because after scrub you had new files and got
over the "4" file minimum limit.

https://issues.apache.org/jira/browse/CASSANDRA-2697

Is the bug report.

2011/6/13 Héctor Izquierdo Seliva 

> Hi All.  I found a way to be able to compact. I have to call scrub on
> the column family. Then scrub gets stuck forever. I restart the node,
> and voila! I can compact again without any message about not having
> enough space. This looks like a bug to me. What info would be needed to
> fill a report? This is on 0.8 updating from 0.7.5
>
>
>

Re: insufficient space to compact even the two smallest files, aborting

2011-06-10 Thread Terje Marthinussen

Yes, which is perfectly fine for a short time if all you want is to compact
to one file for some reason.

I run min_compaction_threshold = 2 on one system here with SSD. No problems
with the more aggressive disk utilization on the SSDs from the extra
compactions, reducing disk space is much more important.

Note that this is a treshold per bucket of similar sized sstables. Not the
total number of sstables, so a treshold of 2 will not give you one big file.

Terje

On Fri, Jun 10, 2011 at 8:56 PM, Maki Watanabe wrote:

> But decreasing min_compaction_threashold will affect on minor
> compaction frequency, won't it?
>
> maki
>
>
> 2011/6/10 Terje Marthinussen :
> > bug in the 0.8.0 release version.
> > Cassandra splits the sstables depending on size and tries to find (by
> > default) at least 4 files of similar size.
> > If it cannot find 4 files of similar size, it logs that message in 0.8.0.
> > You can try to reduce the minimum required  files for compaction and it
> will
> > work.
> > Terje
> > 2011/6/10 Héctor Izquierdo Seliva 
> >>
> >> Hi, I'm running a test node with 0.8, and everytime I try to do a major
> >> compaction on one of the column families this message pops up. I have
> >> plenty of space on disk for it and the sum of all the sstables is
> >> smaller than the free capacity. Is there any way to force the
> >> compaction?
> >>
> >
> >
>
>
>
> --
> w3m
>

Re: insufficient space to compact even the two smallest files, aborting

2011-06-10 Thread Terje Marthinussen

12 sounds perfectly fine in this case.
4 buckets, 3 in each bucket, the minimum default threshold _per  is 4.

Terje

2011/6/10 Héctor Izquierdo Seliva 

>
>
> El vie, 10-06-2011 a las 20:21 +0900, Terje Marthinussen escribió:
> > bug in the 0.8.0 release version.
> >
> >
> > Cassandra splits the sstables depending on size and tries to find (by
> > default) at least 4 files of similar size.
> >
> >
> > If it cannot find 4 files of similar size, it logs that message in
> > 0.8.0.
> >
> >
> > You can try to reduce the minimum required  files for compaction and
> > it will work.
> >
> >
> > Terje
>
>
> Hi Terje,
>
> There are 12 SSTables, so I don't think that's the problem. I will try
> anyway and see what happens.
>
>
>

Re: insufficient space to compact even the two smallest files, aborting

2011-06-10 Thread Terje Marthinussen

bug in the 0.8.0 release version.

Cassandra splits the sstables depending on size and tries to find (by
default) at least 4 files of similar size.

If it cannot find 4 files of similar size, it logs that message in 0.8.0.

You can try to reduce the minimum required  files for compaction and it will
work.

Terje

2011/6/10 Héctor Izquierdo Seliva 

> Hi, I'm running a test node with 0.8, and everytime I try to do a major
> compaction on one of the column families this message pops up. I have
> plenty of space on disk for it and the sum of all the sstables is
> smaller than the free capacity. Is there any way to force the
> compaction?
>
>

Re: Troubleshooting IO performance ?

2011-06-07 Thread Terje Marthinussen

If you run iostat without output every few second, is the I/O stable or do
 you see very uneven I/O?

Regards,
Terje

On Tue, Jun 7, 2011 at 11:12 AM, aaron morton wrote:

> There is a big IO queue and reads are spending a lot of time in the queue.
>
> Some more questions:
> - what version are you on ?
> -  what is the concurrent_reads config setting ?
> - what is nodetool tpstats showing during the slow down ?
> - exactly how much data are you asking for ? how many rows and what sort of
> slice
> - has their been a lot of deletes or TTL columns used ?
>
> Hope that helps.
> Aaron
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 7 Jun 2011, at 10:09, Philippe wrote:
>
> Ok, here it goes again... No swapping at all...
>
> procs ---memory-- ---swap-- -io -system--
> cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
> wa
>  1 63  32044  88736  37996 711652400 227156 0 18314 5607 30  5
> 11 53
>  1 63  32044  90844  37996 710390400 233524   202 17418 4977 29  4
>  9 58
>  0 42  32044  91304  37996 712388400 249736 0 16197 5433 19  6
>  3 72
>  3 25  32044  89864  37996 713598000 22314016 18135 7567 32  5
> 11 52
>  1  1  32044  88664  37996 715072800 229416   128 19168 7554 36  4
> 10 51
>  4  0  32044  89464  37996 714942800 21385218 21041 8819 45  5
> 12 38
>  4  0  32044  90372  37996 714943200 233086   142 19909 7041 43  5
> 10 41
>  7  1  32044  89752  37996 714952000 206906 0 19350 6875 50  4
> 11 35
>
> Lots and lots of disk activity
> iostat -dmx 2
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda  52.50 0.00 7813.000.00   108.01 0.0028.31
>   117.15   14.89   14.890.00   0.11  83.00
> sdb  56.00 0.00 7755.500.00   108.51 0.0028.66
>   118.67   15.18   15.180.00   0.11  82.80
> md1   0.00 0.000.000.00 0.00 0.00 0.00
> 0.000.000.000.00   0.00   0.00
> md5   0.00 0.00 15796.500.00   219.21 0.0028.42
> 0.000.000.000.00   0.00   0.00
> dm-0  0.00 0.00 15796.500.00   219.21 0.0028.42
>   273.42   17.03   17.030.00   0.05  83.40
> dm-1  0.00 0.000.000.00 0.00 0.00 0.00
> 0.000.000.000.00   0.00   0.00
>
> More info :
> - all the data directory containing the data I'm querying into is  9.7GB
> and this is a server with 16GB
> - I'm hitting the server with 6 concurrent multigetsuperslicequeries on
> multiple keys, some of them can bring back quite a number of data
> - I'm reading all the keys for one column, pretty much sequentially
>
> This is a query in a rollup table that was originally in MySQL and it
> doesn't look like the performance to query by key is better. So I'm betting
> I'm doing something wrong here... but what ?
>
> Any ideas ?
> Thanks
>
> 2011/6/6 Philippe 
>
>> hum..no, it wasn't swapping. cassandra was the only thing running on that
>> server
>> and i was querying the same keys over and over
>>
>> i restarted Cassandra and doing the same thing, io is now down to zero
>> while cpu is up which dosen't surprise me as much.
>>
>> I'll report if it happens again.
>> Le 5 juin 2011 16:55, "Jonathan Ellis"  a écrit :
>>
>> > You may be swapping.
>> >
>> > http://spyced.blogspot.com/2010/01/linux-performance-basics.html
>> > explains how to check this as well as how to see what threads are busy
>> > in the Java process.
>> >
>> > On Sat, Jun 4, 2011 at 5:34 PM, Philippe  wrote:
>> >> Hello,
>> >> I am evaluating using cassandra and I'm running into some strange IO
>> >> behavior that I can't explain, I'd like some help/ideas to troubleshoot
>> it.
>> >> I am running a 1 node cluster with a keyspace consisting of two columns
>> >> families, one of which has dozens of supercolumns itself containing
>> dozens
>> >> of columns.
>> >> All in all, this is a couple gigabytes of data, 12GB on the hard drive.
>> >> The hardware is pretty good : 16GB memory + RAID-0 SSD drives with LVM
>> and
>> >> an i5 processor (4 cores).
>> >> Keyspace: xxx
>> >> Read Count: 460754852
>> >> Read Latency: 1.108205793092766 ms.
>> >> Write Count: 30620665
>> >> Write Latency: 0.01411020877567486 ms.
>> >> Pending Tasks: 0
>> >> Column Family: xx
>> >> SSTable count: 5
>> >> Space used (live): 548700725
>> >> Space used (total): 548700725
>> >> Memtable Columns Count: 0
>> >> Memtable Data Size: 0
>> >> Memtable Switch Count: 11
>> >> Read Count: 2891192
>> >> Read Lat

Re: [RELEASE] 0.8.0

2011-06-06 Thread Terje Marthinussen

Yes, I am aware of it but it was not an alternative for this project which
will face production soon.

The patch I have is fairly non-intrusive (especially vs. 674) so I think it
can be interesting depending on how quickly 674 will be integrated into
cassandra releases.

I plan to take a closer look at 674 soon to see if I can add something
there.

Terje

On Tue, Jun 7, 2011 at 1:59 AM, Ryan King  wrote:

>
> You might want to watch
> https://issues.apache.org/jira/browse/CASSANDRA-674 which should be
> ready for testing soon.
>
> -ryan
>

Re: [RELEASE] 0.8.0

2011-06-06 Thread Terje Marthinussen

How did that typo happen...
"across a committed hints file"
should be
"across a corrupted hints file"

Seems like the last supercolumn in the hints file has 0 subcolumns.
This actually seem to be correctly serialized, but my code has a bug and
fail to read it.

When that is said, I wonder why the hint has 0 subcolumns in the first
place?
Is that expected behaviour?

Regards,
Terje


On Mon, Jun 6, 2011 at 10:09 PM, Terje Marthinussen  wrote:

> Of course I talked too soon.
> I saw a corrupted commitlog some days back after killing cassandra and I
> just came across a committed hints file after a cluster restart for some
> config changes :(
> Will look into that.
>
> Otherwise, not defaults, but close.
> The dataset is fed from scratch so yes, memtable_total_space is there.
>
> Some option tuning here and there and a few extra GC options and a
> relatively large patch which makes more compact serialization (this may help
> a bit...)
>
> Most of the tuning dates back to cassandra 0.6/0.7. It could be an
> interesting experiment to see if things got worse without them on 0.8.
>
> Hopefully I can submit the serialization patch soon.
>
> Regards,
> Terje
>
> On Mon, Jun 6, 2011 at 9:12 PM, Jonathan Ellis  wrote:
>
>> Has this been running w/ default settings (i.e. relying on the new
>> memtable_total_space_in_mb) or was this an upgrade from 0.7 (or
>> otherwise had the per-CF memtable settings applied?)
>>
>> On Mon, Jun 6, 2011 at 12:00 AM, Terje Marthinussen
>>  wrote:
>> > 0.8 under load may turn out to be more stable and well behaving than any
>> > release so far
>> > Been doing a few test runs stuffing more than 1 billion records into a
>> 12
>> > node cluster and thing looks better than ever.
>> > VM's stable and nice at 11GB. No data corruptions, dead nodes, full GC's
>> or
>> > any of the other trouble that plagued early 0.7 releases.
>> > Still have to test more nasty stuff like rebalancing or recovering
>> failed
>> > nodes, but so far I would recommend anyone to consider  0.8 over 0.7.x
>> if
>> > setting up a new system
>> > Terje
>> >
>> > On Fri, Jun 3, 2011 at 5:25 PM, Stephen Connolly
>> >  wrote:
>> >>
>> >> Great work!
>> >>
>> >> -Stephen
>> >>
>> >> P.S.
>> >>  As the release of artifacts to Maven Central is now part of the
>> >> release process, the artifacts are all available from Maven Central
>> >> already (for people who use Maven/ANT+Ivy/Gradle/Buildr/etc)
>> >>
>> >> On 3 June 2011 00:36, Eric Evans  wrote:
>> >> >
>> >> > I am very pleased to announce the official release of Cassandra
>> 0.8.0.
>> >> >
>> >> > If you haven't been paying attention to this release, this is your
>> last
>> >> > chance, because by this time tomorrow all your friends are going to
>> be
>> >> > raving, and you don't want to look silly.
>> >> >
>> >> > So why am I resorting to hyperbole?  Well, for one because this is
>> the
>> >> > release that debuts the Cassandra Query Language (CQL).  In one fell
>> >> > swoop Cassandra has become more than NoSQL, it's MoSQL.
>> >> >
>> >> > Cassandra also has distributed counters now.  With counters, you can
>> >> > count stuff, and counting stuff rocks.
>> >> >
>> >> > A kickass use-case for Cassandra is spanning data-centers for
>> >> > fault-tolerance and locality, but doing so has always meant sending
>> data
>> >> > in the clear, or tunneling over a VPN.   New for 0.8.0, encryption of
>> >> > intranode traffic.
>> >> >
>> >> > If you're not motivated to go upgrade your clusters right now, you're
>> >> > either not easily impressed, or you're very lazy.  If it's the
>> latter,
>> >> > would it help knowing that rolling upgrades between releases is now
>> >> > supported?  Yeah.  You can upgrade your 0.7 cluster to 0.8 without
>> >> > shutting it down.
>> >> >
>> >> > You see what I mean?  Then go read the release notes[1] to learn
>> about
>> >> > the full range of awesomeness, then grab a copy[2] and become a
>> >> > (fashionably )early adopter.
>> >> >
>> >> > Drivers for CQL are available in Python[3], Java[3], and Node.js[4].
>> >> >
>> >> > As usual, a Debian package is available from the project's APT
>> >> > repository[5].
>> >> >
>> >> > Enjoy!
>> >> >
>> >> >
>> >> > [1]: http://goo.gl/CrJqJ (NEWS.txt)
>> >> > [2]: http://cassandra.debian.org/download
>> >> > [3]: http://www.apache.org/dist/cassandra/drivers
>> >> > [4]: https://github.com/racker/node-cassandra-client
>> >> > [5]: http://wiki.apache.org/cassandra/DebianPackaging
>> >> >
>> >> > --
>> >> > Eric Evans
>> >> > eev...@rackspace.com
>> >> >
>> >> >
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>

Re: [RELEASE] 0.8.0

2011-06-06 Thread Terje Marthinussen

Of course I talked too soon.
I saw a corrupted commitlog some days back after killing cassandra and I
just came across a committed hints file after a cluster restart for some
config changes :(
Will look into that.

Otherwise, not defaults, but close.
The dataset is fed from scratch so yes, memtable_total_space is there.

Some option tuning here and there and a few extra GC options and a
relatively large patch which makes more compact serialization (this may help
a bit...)

Most of the tuning dates back to cassandra 0.6/0.7. It could be an
interesting experiment to see if things got worse without them on 0.8.

Hopefully I can submit the serialization patch soon.

Regards,
Terje

On Mon, Jun 6, 2011 at 9:12 PM, Jonathan Ellis  wrote:

> Has this been running w/ default settings (i.e. relying on the new
> memtable_total_space_in_mb) or was this an upgrade from 0.7 (or
> otherwise had the per-CF memtable settings applied?)
>
> On Mon, Jun 6, 2011 at 12:00 AM, Terje Marthinussen
>  wrote:
> > 0.8 under load may turn out to be more stable and well behaving than any
> > release so far
> > Been doing a few test runs stuffing more than 1 billion records into a 12
> > node cluster and thing looks better than ever.
> > VM's stable and nice at 11GB. No data corruptions, dead nodes, full GC's
> or
> > any of the other trouble that plagued early 0.7 releases.
> > Still have to test more nasty stuff like rebalancing or recovering failed
> > nodes, but so far I would recommend anyone to consider  0.8 over 0.7.x if
> > setting up a new system
> > Terje
> >
> > On Fri, Jun 3, 2011 at 5:25 PM, Stephen Connolly
> >  wrote:
> >>
> >> Great work!
> >>
> >> -Stephen
> >>
> >> P.S.
> >>  As the release of artifacts to Maven Central is now part of the
> >> release process, the artifacts are all available from Maven Central
> >> already (for people who use Maven/ANT+Ivy/Gradle/Buildr/etc)
> >>
> >> On 3 June 2011 00:36, Eric Evans  wrote:
> >> >
> >> > I am very pleased to announce the official release of Cassandra 0.8.0.
> >> >
> >> > If you haven't been paying attention to this release, this is your
> last
> >> > chance, because by this time tomorrow all your friends are going to be
> >> > raving, and you don't want to look silly.
> >> >
> >> > So why am I resorting to hyperbole?  Well, for one because this is the
> >> > release that debuts the Cassandra Query Language (CQL).  In one fell
> >> > swoop Cassandra has become more than NoSQL, it's MoSQL.
> >> >
> >> > Cassandra also has distributed counters now.  With counters, you can
> >> > count stuff, and counting stuff rocks.
> >> >
> >> > A kickass use-case for Cassandra is spanning data-centers for
> >> > fault-tolerance and locality, but doing so has always meant sending
> data
> >> > in the clear, or tunneling over a VPN.   New for 0.8.0, encryption of
> >> > intranode traffic.
> >> >
> >> > If you're not motivated to go upgrade your clusters right now, you're
> >> > either not easily impressed, or you're very lazy.  If it's the latter,
> >> > would it help knowing that rolling upgrades between releases is now
> >> > supported?  Yeah.  You can upgrade your 0.7 cluster to 0.8 without
> >> > shutting it down.
> >> >
> >> > You see what I mean?  Then go read the release notes[1] to learn about
> >> > the full range of awesomeness, then grab a copy[2] and become a
> >> > (fashionably )early adopter.
> >> >
> >> > Drivers for CQL are available in Python[3], Java[3], and Node.js[4].
> >> >
> >> > As usual, a Debian package is available from the project's APT
> >> > repository[5].
> >> >
> >> > Enjoy!
> >> >
> >> >
> >> > [1]: http://goo.gl/CrJqJ (NEWS.txt)
> >> > [2]: http://cassandra.debian.org/download
> >> > [3]: http://www.apache.org/dist/cassandra/drivers
> >> > [4]: https://github.com/racker/node-cassandra-client
> >> > [5]: http://wiki.apache.org/cassandra/DebianPackaging
> >> >
> >> > --
> >> > Eric Evans
> >> > eev...@rackspace.com
> >> >
> >> >
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: [RELEASE] 0.8.0

2011-06-05 Thread Terje Marthinussen

0.8 under load may turn out to be more stable and well behaving than any
release so far

Been doing a few test runs stuffing more than 1 billion records into a 12
node cluster and thing looks better than ever.
VM's stable and nice at 11GB. No data corruptions, dead nodes, full GC's or
any of the other trouble that plagued early 0.7 releases.

Still have to test more nasty stuff like rebalancing or recovering failed
nodes, but so far I would recommend anyone to consider  0.8 over 0.7.x if
setting up a new system

Terje

On Fri, Jun 3, 2011 at 5:25 PM, Stephen Connolly <
stephen.alan.conno...@gmail.com> wrote:

> Great work!
>
> -Stephen
>
> P.S.
>  As the release of artifacts to Maven Central is now part of the
> release process, the artifacts are all available from Maven Central
> already (for people who use Maven/ANT+Ivy/Gradle/Buildr/etc)
>
> On 3 June 2011 00:36, Eric Evans  wrote:
> >
> > I am very pleased to announce the official release of Cassandra 0.8.0.
> >
> > If you haven't been paying attention to this release, this is your last
> > chance, because by this time tomorrow all your friends are going to be
> > raving, and you don't want to look silly.
> >
> > So why am I resorting to hyperbole?  Well, for one because this is the
> > release that debuts the Cassandra Query Language (CQL).  In one fell
> > swoop Cassandra has become more than NoSQL, it's MoSQL.
> >
> > Cassandra also has distributed counters now.  With counters, you can
> > count stuff, and counting stuff rocks.
> >
> > A kickass use-case for Cassandra is spanning data-centers for
> > fault-tolerance and locality, but doing so has always meant sending data
> > in the clear, or tunneling over a VPN.   New for 0.8.0, encryption of
> > intranode traffic.
> >
> > If you're not motivated to go upgrade your clusters right now, you're
> > either not easily impressed, or you're very lazy.  If it's the latter,
> > would it help knowing that rolling upgrades between releases is now
> > supported?  Yeah.  You can upgrade your 0.7 cluster to 0.8 without
> > shutting it down.
> >
> > You see what I mean?  Then go read the release notes[1] to learn about
> > the full range of awesomeness, then grab a copy[2] and become a
> > (fashionably )early adopter.
> >
> > Drivers for CQL are available in Python[3], Java[3], and Node.js[4].
> >
> > As usual, a Debian package is available from the project's APT
> > repository[5].
> >
> > Enjoy!
> >
> >
> > [1]: http://goo.gl/CrJqJ (NEWS.txt)
> > [2]: http://cassandra.debian.org/download
> > [3]: http://www.apache.org/dist/cassandra/drivers
> > [4]: https://github.com/racker/node-cassandra-client
> > [5]: http://wiki.apache.org/cassandra/DebianPackaging
> >
> > --
> > Eric Evans
> > eev...@rackspace.com
> >
> >
>

Re: Memory Usage During Read

2011-05-14 Thread Terje Marthinussen

Out of curiosity, could you try to disable mmap as well?

I had some problems  here some time back and I wanted to see better what was
going on and disabled the mmap.
I actually don't think I have the same problem again, but I have seen javavm
sizes up in 30-40MB with a heap of just 16.

Haven't paid much attention since fortunately the server had enough memory
anyway.

Terje

On Sat, May 14, 2011 at 7:20 PM, Sris  wrote:

>   Typically nothing is ever logged other than the GC failures
>

> In addition to the heapdumps,
> be useful to see some GC logs
>  (turn on GC logs via  cassandra.in.sh
> Or add
> -Xloggc:/var/log/cassandra/gc.log
> -XX:+PrintGCDetails
>  )
>
> thanks, Sri
>
> On May 7, 2011, at 6:37 PM, Jonathan Ellis  wrote:
>
> The live:serialized size ratio depends on what your data looks like
>> (small columns will be less efficient than large blobs) but using the
>> rule of thumb of 10x, around 1G * (1 + memtable_flush_writers +
>> memtable_flush_queue_size).
>>
>> So first thing I would do is drop writers and queue to 1 and 1.
>>
>> Then I would drop the max heap to 1G, memtable size to 8MB so the heap
>> dump is easier to analyze. Then let it OOM and look at the dump with
>> http://www.eclipse.org/mat/
>>
>> On Sat, May 7, 2011 at 3:54 PM, Serediuk, Adam
>>  wrote:
>>
>>> How much memory should a single hot cf with a 128mb memtable take with
>>> row and key caching disabled during read?
>>>
>>> Because I'm seeing heap go from 3.5gb skyrocketing straight to max
>>> (regardless of the size, 8gb and 24gb both do the same) at which time the
>>> jvm will do nothing but full gc and is unable to reclaim any meaningful
>>> amount of memory. Cassandra then becomes unusable.
>>>
>>> I see the same behavior with smaller memtables, eg 64mb.
>>>
>>> This happens well into the read operation an only on a small number of
>>> nodes in the cluster(1-4 out of a total of 60 nodes.)
>>>
>>> Sent from my iPhone
>>>
>>> On May 6, 2011, at 22:45, "Jonathan Ellis"  wrote:
>>>
>>> You don't GC storm without legitimately having a too-full heap.  It's
 normal to see occasional full GCs from fragmentation, but that will
 actually compact the heap and everything goes back to normal IF you
 had space actually freed up.

 You say you've played w/ memtable size but that would still be my bet.
 Most people severely underestimate how much space this takes (10x in
 memory over serialized size), which will bite you when you have lots
 of CFs defined.

 Otherwise, force a heap dump after a full GC and take a look to see
 what's referencing all the memory.

 On Fri, May 6, 2011 at 12:25 PM, Serediuk, Adam
  wrote:

> We're troubleshooting a memory usage problem during batch reads. We've
> spent the last few days profiling and trying different GC settings. The
> symptoms are that after a certain amount of time during reads one or more
> nodes in the cluster will exhibit extreme memory pressure followed by a gc
> storm. We've tried every possible JVM setting and different GC methods and
> the issue persists. This is pointing towards something instantiating a lot
> of objects and keeping references so that they can't be cleaned up.
>
> Typically nothing is ever logged other than the GC failures however
> just now one of the nodes emitted logs we've never seen before:
>
>  INFO [ScheduledTasks:1] 2011-05-06 15:04:55,085 StorageService.java
> (line 2218) Unable to reduce heap usage since there are no dirty column
> families
>
> We have tried increasing the heap on these nodes to large values, eg
> 24GB and still run into the same issue. We're running 8GB of heap normally
> and only one or two nodes will ever exhibit this issue, randomly. We don't
> use key/row caching and our memtable sizing is 64mb/0.3. Larger or smaller
> memtables make no difference in avoiding the issue. We're on 0.7.5, mmap,
> jna and jdk 1.6.0_24
>
> We've somewhat hit the wall in troubleshooting and any advice is
> greatly appreciated.
>
> --
> Adam
>
>


 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com


>>>
>>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>

Re: Excessive allocation during hinted handoff

2011-05-12 Thread Terje Marthinussen

An if you have 10 nodes, do all of them happen to send hints to the two with
GC?

Terje

On Thu, May 12, 2011 at 6:10 PM, Terje Marthinussen  wrote:

> Just out of curiosity is this on the receiver or sender side?
>
> I have been wondering a bit if the hint playback could need some
> adjustment.
> There is potentially quite big differences on how much is sent per throttle
> delay time depending on what your data looks like.
>
> Early 0.7 releases also built up hints very easily under load due to nodes
> quickly getting marked as down due to gossip sharing the same thread as many
> other operations.
>
> Terje
>
> On Thu, May 12, 2011 at 1:28 PM, Jonathan Ellis  wrote:
>
>> Doesn't really look abnormal to me for a heavy write load situation
>> which is what "receiving hints" is.
>>
>> On Wed, May 11, 2011 at 1:55 PM, Gabriel Tataranu 
>> wrote:
>> > Greetings,
>> >
>> > I'm experiencing some issues with 2 nodes (out of more than 10). Right
>> > after startup (Listening for thrift clients...) the nodes will create
>> > objects at high rate using all available CPU cores:
>> >
>> >  INFO 18:13:15,350 GC for PS Scavenge: 292 ms, 494902976 reclaimed
>> > leaving 2024909864 used; max is 6658457600
>> >  INFO 18:13:20,393 GC for PS Scavenge: 252 ms, 478691280 reclaimed
>> > leaving 2184252600 used; max is 6658457600
>> > 
>> >  INFO 18:15:23,909 GC for PS Scavenge: 283 ms, 452943472 reclaimed
>> > leaving 5523891120 used; max is 6658457600
>> >  INFO 18:15:24,912 GC for PS Scavenge: 273 ms, 466157568 reclaimed
>> > leaving 5594606128 used; max is 6658457600
>> >
>> > This will eventually trigger old-gen GC and then the process repeats
>> > until hinted handoff finishes.
>> >
>> > The build version was updated from 0.7.2 to 0.7.5 but the behavior was
>> > exactly the same.
>> >
>> > Thank you.
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>

Re: Excessive allocation during hinted handoff

2011-05-12 Thread Terje Marthinussen

Just out of curiosity is this on the receiver or sender side?

I have been wondering a bit if the hint playback could need some
adjustment.
There is potentially quite big differences on how much is sent per throttle
delay time depending on what your data looks like.

Early 0.7 releases also built up hints very easily under load due to nodes
quickly getting marked as down due to gossip sharing the same thread as many
other operations.

Terje

On Thu, May 12, 2011 at 1:28 PM, Jonathan Ellis  wrote:

> Doesn't really look abnormal to me for a heavy write load situation
> which is what "receiving hints" is.
>
> On Wed, May 11, 2011 at 1:55 PM, Gabriel Tataranu 
> wrote:
> > Greetings,
> >
> > I'm experiencing some issues with 2 nodes (out of more than 10). Right
> > after startup (Listening for thrift clients...) the nodes will create
> > objects at high rate using all available CPU cores:
> >
> >  INFO 18:13:15,350 GC for PS Scavenge: 292 ms, 494902976 reclaimed
> > leaving 2024909864 used; max is 6658457600
> >  INFO 18:13:20,393 GC for PS Scavenge: 252 ms, 478691280 reclaimed
> > leaving 2184252600 used; max is 6658457600
> > 
> >  INFO 18:15:23,909 GC for PS Scavenge: 283 ms, 452943472 reclaimed
> > leaving 5523891120 used; max is 6658457600
> >  INFO 18:15:24,912 GC for PS Scavenge: 273 ms, 466157568 reclaimed
> > leaving 5594606128 used; max is 6658457600
> >
> > This will eventually trigger old-gen GC and then the process repeats
> > until hinted handoff finishes.
> >
> > The build version was updated from 0.7.2 to 0.7.5 but the behavior was
> > exactly the same.
> >
> > Thank you.
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: column bloat

2011-05-11 Thread Terje Marthinussen

On Wed, May 11, 2011 at 8:06 AM, aaron morton wrote:

> For a reasonable large amount of use cases (for me, 2 out of 3 at the
> moment) supercolumns will be units of data where the columns (attributes)
> will never change by themselves or where the data does not change anyway
> (archived data).
>
>
> Can you use a standard CF and pack the multiple columns into one value in
> your app ? It sounds like the super columns are just acting as opaque
> containers, and cassandra does not need to know these are different values.
> Agree this only works if there is no concurrent access on the sub columns.
> I'm suggesting this with one eye on
> https://issues.apache.org/jira/browse/CASSANDRA-2231
>
>
I have a great interest in sharing data across applications using cassandra.
This means I also have a great interest in removing serialization from the
applications :)
That I can get reasonably far without serialization logic in the application
is one of the main reasons I am working on Cassandra.

Yes, I have had this discussion before so I know the next suggestion would
be to build an API on top doing the serialization, but that will further
complicate things if I want to integrate with hadoop or other similar tools,
so why should I if I don't have to? :)

> It would seem like a good optimization to allow a timestamp on the
> supercolumn instead and remove the one on columns?
>
> I believe this may also work as an optimization on compactions? Just skip
> merging of columns under the supercolumn if the supercolumn has a timestamp
> and just replace the entire supercolumn in that case.
>
> Could be just a variation of the supercolumn object on insert. No
> timestamp, use the one in the columns, include timestamp, ignore timestamps
> in columns.
>
>
> SC's are more containers than columns, when it comes to reconciling their
> contents they act like column families: ask the columns to reconcile
> respecting the containers tombstone. Giving the SC a timestamp and making
> them act like columns would be a major change.
>

Not so sure it would be a major change, but if we can make an assumption
that people (or APIs) will be smart enough to feed data where all columns
has the same timestamp if they want to save some disk,  I guess this can be
compressed quite efficiently anyway.

Terje

Re: compaction strategy

2011-05-11 Thread Terje Marthinussen

>
>
> Not sure I follow you. 4 sstables is the minimum compaction look for
> (by default).
> If there is 30 sstables of ~20MB sitting there because compaction is
> behind, you
> will compact those 30 sstables together (unless there is not enough space
> for
> that and considering you haven't changed the max compaction threshold (32
> by
> default)). And you can increase max threshold.
> Don't get me wrong, I'm not pretending this works better than it does, but
> let's not pretend either that it's worth than it is.
>
>
Sorry, I am not trying to pretend anything or blow it out of proportions.
Just reacting to what I see.

This is what I see after some stress testing of some pretty decent HW.

81 Up Normal  181.6 GB8.33%   Token(bytes[30])

82 Up Normal  501.43 GB   8.33%   Token(bytes[313230])

83 Up Normal  248.07 GB   8.33%   Token(bytes[313437])

84 Up Normal  349.64 GB   8.33%   Token(bytes[313836])

85 Up Normal  511.55 GB   8.33%   Token(bytes[323336])

86 Up Normal  654.93 GB   8.33%   Token(bytes[333234])

87 UpNormal  534.77 GB   8.33%   Token(bytes[333939])

88 Up   Normal  525.88 GB   8.33%   Token(bytes[343739])

89 Up Normal  476.6 GB8.33%   Token(bytes[353730])

90 Up Normal  424.89 GB   8.33%   Token(bytes[363635])

91 Up Normal  338.14 GB   8.33%   Token(bytes[383036])

92 Up Normal  546.95 GB   8.33%   Token(bytes[6a])

.81 has been exposed to a full compaction. It had ~370GB before that and the
resulting sstable is 165GB.
The other nodes has only been doing minor compactions

I think this is a problem.
You are of course free to disagree.

I do however recommend doing a simulation on potential worst case scenarios
if many of the buckets end up with 3 sstables and don't compact for a while.
The disk space requirements  get pretty bad even without getting into
theoretical worst cases.

Regards,
Terje

Re: compaction strategy

2011-05-10 Thread Terje Marthinussen

> Everyone may be well aware of that, but I'll still remark that a minor
> compaction
> will try to merge "as many 20MB sstables as it can" up to the max
> compaction
> threshold (which is configurable). So if you do accumulate some newly
> created
> sstable at some point in time, the next minor compaction will take all of
> them
> and thus not create a 40 MB sstable, then 80MB etc... Sure there will be
> more
> step than with a major compaction, but let's keep in mind we don't
> merge sstables
> 2 by 2.
>

Well, you do kind of merge them 2 by 2 as you look for at least 4 at a time
;)
But yes, 20MB should become at least 80MB. Still quite a few hops to reach
100GB.

I'm also not too much in favor of triggering major compactions,
> because it mostly
> have a nasty effect (create one huge sstable). Now maybe we could expose
> the
> difference factor for which we'll consider sstables in the same bucket
>

The nasty side effect I am scared of is disk space and to keep the disk
space under control, I need to get down to 1 file.

As an example:
2 days ago, I looked at a system that had gone idle from compaction with
something like 24 sstables.
Disk use was 370GB.

After manually triggering full compaction,  I was left with a single sstable
which is 164 GB large.

This means I may need more than 3x the full dataset to survive if certain
nasty events such as repairs or anti compactions should occur.
Way more than the recommended 2x.

In the same system, I see nodes reaching up towards 900GB during compaction
and 5-600GB otherwise.
This is with OPP, so distribution is not 100% perfect, but I expect these
5-600GB nodes to compact down to the <200GB area if a full compaction is
triggered.

That is way way beyond the recommendation to have 2x the disk space.

You may disagree, but I think this is a problem.
Either we need to recommend 3-5x the best case disk usage or we need to fix
cassandra.

A simple improvement initially may be to change the bucketing strategy if
you cannot find suitable candidates.
I believe lucene for instance has a strategy where it can mix a set of small
index fragments with one large.
This may be possible to consider as a fallback strategy and just let
cassandra compact down to 1 file whenever it can.

Ultimately, I think segmenting on token space is the only way to fix this.
That segmentation could be done by building histograms of your token
distribution as you compact and the compaction can further adjust the
segments accordingly as full compactions take place.

This would seem simpler to do than a full vnode based infrastructure.

Terje

Re: column bloat

2011-05-10 Thread Terje Marthinussen

> Anyway, to sum that up, expiring columns are 1 byte more and
> non-expiring ones are 7 bytes
> less. Not arguing, it's still fairly verbose, especially with tons of
> very small columns.
>

Yes, you are right, sorry.
Trying to do one thing to many at the same time.
My brain filtered out part of the "else if".


>
> > - inherit timestamps from the supercolumn
>
> Columns inside a supercolumn have no reason to share the same timestamp (or
> even close ones for that matter). But maybe you're talking about something
> more
> subtle, in which case yes there is ways to compress the data.
>

For a reasonable large amount of use cases (for me, 2 out of 3 at the
moment) supercolumns will be units of data where the columns (attributes)
will never change by themselves or where the data does not change anyway
(archived data).

It would seem like a good optimization to allow a timestamp on the
supercolumn instead and remove the one on columns?

I believe this may also work as an optimization on compactions? Just skip
merging of columns under the supercolumn if the supercolumn has a timestamp
and just replace the entire supercolumn in that case.

Could be just a variation of the supercolumn object on insert. No timestamp,
use the one in the columns, include timestamp, ignore timestamps in columns.

If that sounds like a sensible idea, I may be tempted to try to get time to
implement it.

I am also tempted to do some other things like make some of the "ints" and
"shorts" variable length as well.

Terje

column bloat

2011-05-10 Thread Terje Marthinussen

Hi,

If you make a supercolumn today, what you end up with is:
- short  + "Super Column name"
- int (local deletion time)
- long (delete time)
Byte array of  columns each with:
  - short + "column name"
  - int (TTL)
  - int (local deletion time)
  - long (timestamp)
  - int + "value of column"

That is, meta data and serialization overhead adds up to:
2+4+8 = 14 bytes for the supercolumn
2+4+4+8+4 = 22 bytes for each column the supercolumn have

Yes, disk space is cheap and all that, but trying to handle a few billion
supercolumns which each have some 30-50 subcolumns, I am looking at some
1.2-1.5TB of meta data which makes the metadata by itself some 3-4 times the
orginal data. That does seem a bit excessive when you also throw in RF=3 and
the requirement for extra diskspace to safely survive compactions.

And yes, this is without considering the overhead of column names.

I can see a handful of way to reduce this quite a bit, for instance by:
- not adding TTL/deletion time if not needed (some compact bitmap structure
to turn on/off fields?)
- inherit timestamps from the supercolumn

There may also be some interesting ways to compress this data assuming that
the timestamps are generally in the same time areas (shared "prefixes"
for instance) , but that gets a bit more complex.

Any opinions or plans?
Sorry, I could not find any JIRA's on the topic, but I guess I am not
surprised if it exists.

Yes, I could serialize this myself outside of cassandra, but that would sort
of defeat the purpose of using a more advanced storage system like
cassandra.

Regards,
Terje

Re: compaction strategy

2011-05-09 Thread Terje Marthinussen

Sorry, I was referring to the claim that "one big file" was a problem, not
the non-overlapping part.

If you never compact to a single file, you never get rid of all
generations/duplicates.
With non-overlapping files covering small enough token ranges, compacting
down to one file is not a big issue.

Terje

On Mon, May 9, 2011 at 8:52 PM, David Boxenhorn  wrote:

> If they each have their own copy of the data, then they are *not*
> non-overlapping!
>
> If you have non-overlapping SSTables (and you know the min/max keys), it's
> like having one big SSTable because you know exactly where each row is, and
> it becomes easy to merge a new SSTable in small batches, rather than in one
> huge batch.
>
> The only step that you have to add to the current merge process is, when
> you going to write a new SSTable, if it's too big, to write N
> (non-overlapping!) pieces instead.
>
>
> On Mon, May 9, 2011 at 12:46 PM, Terje Marthinussen <
> tmarthinus...@gmail.com> wrote:
>
>> Yes, agreed.
>>
>> I actually think cassandra has to.
>>
>> And if you do not go down to that single file, how do you avoid getting
>> into a situation where you can very realistically end up with 4-5 big
>> sstables each having its own copy of the same data massively increasing disk
>> requirements?
>>
>> Terje
>>
>> On Mon, May 9, 2011 at 5:58 PM, David Boxenhorn wrote:
>>
>>> "I'm also not too much in favor of triggering major compactions, because
>>> it mostly have a nasty effect (create one huge sstable)."
>>>
>>> If that is the case, why can't major compactions create many,
>>> non-overlapping SSTables?
>>>
>>> In general, it seems to me that non-overlapping SSTables have all the
>>> advantages of big SSTables (i.e. you know exactly where the data is) without
>>> the disadvantages that come with being big. Why doesn't Cassandra take
>>> advantage of that in a major way?
>>>
>>
>>
>

Re: compaction strategy

2011-05-09 Thread Terje Marthinussen

Yes, agreed.

I actually think cassandra has to.

And if you do not go down to that single file, how do you avoid getting into
a situation where you can very realistically end up with 4-5 big sstables
each having its own copy of the same data massively increasing disk
requirements?

Terje

On Mon, May 9, 2011 at 5:58 PM, David Boxenhorn  wrote:

> "I'm also not too much in favor of triggering major compactions, because it
> mostly have a nasty effect (create one huge sstable)."
>
> If that is the case, why can't major compactions create many,
> non-overlapping SSTables?
>
> In general, it seems to me that non-overlapping SSTables have all the
> advantages of big SSTables (i.e. you know exactly where the data is) without
> the disadvantages that come with being big. Why doesn't Cassandra take
> advantage of that in a major way?
>

Re: compaction strategy

2011-05-07 Thread Terje Marthinussen

This is an all ssd system. I have no problems with read/write performance
due to I/O.
I do have a potential with the crazy explosion you can get in terms of disk
use if compaction cannot keep up.

As things falls behind and you get many generations of data, yes, read
performance gets a problem due to the number of sstables.

As things start falling behind, you have a bunch of minor compactions trying
to merge 20MB (sstables cassandra generally dumps with current config when
under pressure) into 40 MB into 80MB into

Anyone wants to do the math on how many times you are rewriting the data
going this route?

There is just no way this can keep up. It will just fall more and more
behind.
Only way to recover as I can see would be to trigger a full compaction?

It does not really make sense to me to go through all these minor merges
when a full compaction will do a much faster and better job.

Terje

On Sat, May 7, 2011 at 9:54 PM, Jonathan Ellis  wrote:

> On Sat, May 7, 2011 at 2:01 AM, Terje Marthinussen
>  wrote:
> > 1. Would it make sense to make full compactions occur a bit more
> aggressive.
>
> I'd rather reduce the performance impact of being behind, than do more
> full compactions: https://issues.apache.org/jira/browse/CASSANDRA-2498
>
> > 2. I
> > would think the code should be smart enough to either trigger a full
> > compaction and scrap the current queue, or at least merge some of those
> > pending tasks into larger ones
>
> Not crazy but a queue-rewriter would be nontrivial. For now I'm okay
> with saying "add capacity until compaction can mostly keep up." (Most
> people's problem is making compaction LESS aggressive, hence
> https://issues.apache.org/jira/browse/CASSANDRA-2156.)
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

compaction strategy

2011-05-07 Thread Terje Marthinussen

Even with the current concurrent compactions, given a high speed datafeed,
compactions will obviously start lagging at some stage, and once it does,
things can turn bad in terms of disk usage and read performance.

I have not read the compaction code well, but if
http://wiki.apache.org/cassandra/MemtableSSTable is up to date, I am
wondering if it:
1. Would it make sense to make full compactions occur a bit more aggressive.
That is, regardless of sstables of matching sizes, if the total number of
outstanding sstables gets above a certain datasize, would it make sense to
just schedule a full compaction rather than go through all the hoops of
gradually merging them in groups of matching sizes?

2. This is same topic, just another viewpoint. When you get to the stage
that compactionstats shows you something crazy like "pending tasks: 600", I
would think the code should be smart enough to either trigger a full
compaction and scrap the current queue, or at least merge some of those
pending tasks into larger ones instead of reading and writing the same data
again and again gradually merging it into larger and larger sizes?

The target behind both 1 and 2 would be to reduce the number of time data is
re-read and re-written in compactions before you reach the full dataset
size.

Re: MemtablePostFlusher with high number of pending calls?

2011-05-04 Thread Terje Marthinussen

Yes, some sort of data structure to coordinate this could reduce the problem
as well.
I made some comments on that at the end of 2558.

I believe a coordinator could be in place both to
- plan the start of compaction
and
- to coordinate compaction thread shutdown and tmp file deletion before we
completely run out of disk space

Regards,
Terje

On Wed, May 4, 2011 at 10:09 PM, Jonathan Ellis  wrote:

> Or we could "reserve" space when starting a compaction.
>
> On Wed, May 4, 2011 at 2:32 AM, Terje Marthinussen
>  wrote:
> > Partially, I guess this may be a side effect of multithreaded
> compactions?
> > Before running out of space completely, I do see a few of these:
> >  WARN [CompactionExecutor:448] 2011-05-02 01:08:10,480
> > CompactionManager.java (line 516) insufficient space to compact all
> > requested files SSTableReader(path='/data/cassandra/JP_MALL_P
> > H/Order-f-12858-Data.db'),
> > SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12851-Data.db'),
> > SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12864-Data.db')
> >  INFO [CompactionExecutor:448] 2011-05-02 01:08:10,481
> StorageService.java
> > (line 2066) requesting GC to free disk space
> > In this case, there would be 24 threads that asked if there was empty
> disk
> > space.
> > Most of them probably succeeded in that request, but they could have
> > requested 24x available space in theory since I do not think there is any
> > global pool of used disk in place that manages which how much disk space
> > will be needed for already started compactions?
> > Of course, regardless how much checking there is in advance, we could
> still
> > run out of disk, so I guess there is also a need for checking if
> diskspace
> > is about to run out while compaction runs so things may be
> halted/aborted.
> > Unfortunately that would need global coordination so we do not stop all
> > compaction threads
> > After reducing to 6 compaction threads in 0.8 beta2, the data has
> compacted
> > just fine without any disk space issues, so I guess another problem you
> may
> > hit as you get a lot of sstables which have updates (that is, duplicates)
> to
> > the same data, is that of course, the massively concurrent compaction
> taking
> > place with nproc threads could also concurrently duplicate all the
> > duplicates on a large scale.
> > Yes, this is in favour of multithreaded compaction as it should normally
> > help keeping sstables to a sane level and avoid such problems, but it is
> > unfortunately just a kludge to the real problem which is to segment the
> > sstables somehow on keyspace so we can get down the disk requirements and
> > recover from scenarios where disk gets above 50%.
> > Regards,
> > Terje
> >
> >
> > On Wed, May 4, 2011 at 3:33 PM, Terje Marthinussen <
> tmarthinus...@gmail.com>
> > wrote:
> >>
> >> Well, just did not look at these logs very well at all last night
> >> First out of disk message:
> >> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> >> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >> Thread[CompactionExecutor:387,1,main]
> >> java.io.IOException: No space left on device
> >> Then finally the last one
> >> ERROR [FlushWriter:128] 2011-05-02 01:51:06,112
> >> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >> Thread[FlushWriter:128,5,main]
> >> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient
> disk
> >> space to flush 554962 bytes
> >> at
> >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >> at java.lang.Thread.run(Thread.java:662)
> >> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> >> 554962 bytes
> >> at
> >>
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> >> at
> >>
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> >> at
> >> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> >> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> >> at
> >> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable

Re: MemtablePostFlusher with high number of pending calls?

2011-05-04 Thread Terje Marthinussen

Partially, I guess this may be a side effect of multithreaded compactions?

Before running out of space completely, I do see a few of these:
 WARN [CompactionExecutor:448] 2011-05-02 01:08:10,480
CompactionManager.java (line 516) insufficient space to compact all
requested files SSTableReader(path='/data/cassandra/JP_MALL_P
H/Order-f-12858-Data.db'),
SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12851-Data.db'),
SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12864-Data.db')
 INFO [CompactionExecutor:448] 2011-05-02 01:08:10,481 StorageService.java
(line 2066) requesting GC to free disk space

In this case, there would be 24 threads that asked if there was empty disk
space.

Most of them probably succeeded in that request, but they could have
requested 24x available space in theory since I do not think there is any
global pool of used disk in place that manages which how much disk space
will be needed for already started compactions?

Of course, regardless how much checking there is in advance, we could still
run out of disk, so I guess there is also a need for checking if diskspace
is about to run out while compaction runs so things may be halted/aborted.
Unfortunately that would need global coordination so we do not stop all
compaction threads

After reducing to 6 compaction threads in 0.8 beta2, the data has compacted
just fine without any disk space issues, so I guess another problem you may
hit as you get a lot of sstables which have updates (that is, duplicates) to
the same data, is that of course, the massively concurrent compaction taking
place with nproc threads could also concurrently duplicate all the
duplicates on a large scale.

Yes, this is in favour of multithreaded compaction as it should normally
help keeping sstables to a sane level and avoid such problems, but it is
unfortunately just a kludge to the real problem which is to segment the
sstables somehow on keyspace so we can get down the disk requirements and
recover from scenarios where disk gets above 50%.

Regards,
Terje

On Wed, May 4, 2011 at 3:33 PM, Terje Marthinussen
wrote:

> Well, just did not look at these logs very well at all last night
> First out of disk message:
> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[CompactionExecutor:387,1,main]
> java.io.IOException: No space left on device
>
> Then finally the last one
> ERROR [FlushWriter:128] 2011-05-02 01:51:06,112
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[FlushWriter:128,5,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
> space to flush 554962 bytes
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> 554962 bytes
> at
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> at
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> at
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> at
> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> ... 3 more
>  INFO [CompactionExecutor:451] 2011-05-02 01:51:06,113 StorageService.java
> (line 2066) requesting GC to free disk space
> [lots of sstables deleted]
>
> After this is starts running again (although not fine it seems).
>
> So the disk seems to have been full for 35 minutes due to un-deleted
> sstables.
>
> Terje
>
> On Wed, May 4, 2011 at 6:34 AM, Terje Marthinussen <
> tmarthinus...@gmail.com> wrote:
>
>> Hm... peculiar.
>>
>> Post flush is not involved in compactions, right?
>>
>> May 2nd
>> 01:06 - Out of disk
>> 01:51 - Starts a mix of major and minor compactions on different column
>> families
>> It then starts a few minor compactions extra over the day, but given that
>> there are more than 1000 sstables, and we are talking 3 minor compactions
>> started, it is not normal I think.
>> May 3rd 1 minor compaction started.
>>
>> When I checked today, there was a bunch of tmp files on the disk with last
>> modify time from 01:something on may 2nd and 200GB empty disk...
>>
>> Definitely no compaction going on.
>&g

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen

Well, just did not look at these logs very well at all last night
First out of disk message:
ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
AbstractCassandraDaemon.java (line 112) Fatal exception in thread
Thread[CompactionExecutor:387,1,main]
java.io.IOException: No space left on device

Then finally the last one
ERROR [FlushWriter:128] 2011-05-02 01:51:06,112 AbstractCassandraDaemon.java
(line 112) Fatal exception in thread Thread[FlushWriter:128,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
space to flush 554962 bytes
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Insufficient disk space to flush
554962 bytes
at
org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
at
org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
 INFO [CompactionExecutor:451] 2011-05-02 01:51:06,113 StorageService.java
(line 2066) requesting GC to free disk space
[lots of sstables deleted]

After this is starts running again (although not fine it seems).

So the disk seems to have been full for 35 minutes due to un-deleted
sstables.

Terje

On Wed, May 4, 2011 at 6:34 AM, Terje Marthinussen
wrote:

> Hm... peculiar.
>
> Post flush is not involved in compactions, right?
>
> May 2nd
> 01:06 - Out of disk
> 01:51 - Starts a mix of major and minor compactions on different column
> families
> It then starts a few minor compactions extra over the day, but given that
> there are more than 1000 sstables, and we are talking 3 minor compactions
> started, it is not normal I think.
> May 3rd 1 minor compaction started.
>
> When I checked today, there was a bunch of tmp files on the disk with last
> modify time from 01:something on may 2nd and 200GB empty disk...
>
> Definitely no compaction going on.
> Guess I will add some debug logging and see if I get lucky and run out of
> disk again.
>
> Terje
>
> On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis  wrote:
>
>> Compaction does, but flush didn't until
>> https://issues.apache.org/jira/browse/CASSANDRA-2404
>>
>> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
>>  wrote:
>> > Yes, I realize that.
>> > I am bit curious why it ran out of disk, or rather, why I have 200GB
>> empty
>> > disk now, but unfortunately it seems like we may not have had monitoring
>> > enabled on this node to tell me what happened in terms of disk usage.
>> > I also thought that compaction was supposed to resume (try again with
>> less
>> > data) if it fails?
>> > Terje
>> >
>> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis 
>> wrote:
>> >>
>> >> post flusher is responsible for updating commitlog header after a
>> >> flush; each task waits for a specific flush to complete, then does its
>> >> thing.
>> >>
>> >> so when you had a flush catastrophically fail, its corresponding
>> >> post-flush task will be stuck.
>> >>
>> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
>> >>  wrote:
>> >> > Just some very tiny amount of writes in the background here (some
>> hints
>> >> > spooled up on another node slowly coming in).
>> >> > No new data.
>> >> >
>> >> > I thought there was no exceptions, but I did not look far enough back
>> in
>> >> > the
>> >> > log at first.
>> >> > Going back a bit further now however, I see that about 50 hours ago:
>> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
>> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
>> >> > Thread[CompactionExecutor:387,1,main]
>> >> > java.io.IOException: No space left on device
>> >> > at java.io.RandomAccessFile.writeBytes(Native Method)
>> >> > at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
>> >> > at
>> >> >
>> >> >
>> org.

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen

Hm... peculiar.

Post flush is not involved in compactions, right?

May 2nd
01:06 - Out of disk
01:51 - Starts a mix of major and minor compactions on different column
families
It then starts a few minor compactions extra over the day, but given that
there are more than 1000 sstables, and we are talking 3 minor compactions
started, it is not normal I think.
May 3rd 1 minor compaction started.

When I checked today, there was a bunch of tmp files on the disk with last
modify time from 01:something on may 2nd and 200GB empty disk...

Definitely no compaction going on.
Guess I will add some debug logging and see if I get lucky and run out of
disk again.

Terje

On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis  wrote:

> Compaction does, but flush didn't until
> https://issues.apache.org/jira/browse/CASSANDRA-2404
>
> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
>  wrote:
> > Yes, I realize that.
> > I am bit curious why it ran out of disk, or rather, why I have 200GB
> empty
> > disk now, but unfortunately it seems like we may not have had monitoring
> > enabled on this node to tell me what happened in terms of disk usage.
> > I also thought that compaction was supposed to resume (try again with
> less
> > data) if it fails?
> > Terje
> >
> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis 
> wrote:
> >>
> >> post flusher is responsible for updating commitlog header after a
> >> flush; each task waits for a specific flush to complete, then does its
> >> thing.
> >>
> >> so when you had a flush catastrophically fail, its corresponding
> >> post-flush task will be stuck.
> >>
> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
> >>  wrote:
> >> > Just some very tiny amount of writes in the background here (some
> hints
> >> > spooled up on another node slowly coming in).
> >> > No new data.
> >> >
> >> > I thought there was no exceptions, but I did not look far enough back
> in
> >> > the
> >> > log at first.
> >> > Going back a bit further now however, I see that about 50 hours ago:
> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >> > Thread[CompactionExecutor:387,1,main]
> >> > java.io.IOException: No space left on device
> >> > at java.io.RandomAccessFile.writeBytes(Native Method)
> >> > at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> >> > at
> >> >
> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> >> > at
> >> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >> > at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >> > at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >> > at java.lang.Thread.run(Thread.java:662)
> >> > [followed by a few more of those...]
> >> > and then a bunch of these:
> >> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> >> > AbstractCassandraDaemon.java
> >> > (line 112) Fatal exce

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen

Yes, I realize that.

I am bit curious why it ran out of disk, or rather, why I have 200GB empty
disk now, but unfortunately it seems like we may not have had monitoring
enabled on this node to tell me what happened in terms of disk usage.

I also thought that compaction was supposed to resume (try again with less
data) if it fails?

Terje

On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis  wrote:

> post flusher is responsible for updating commitlog header after a
> flush; each task waits for a specific flush to complete, then does its
> thing.
>
> so when you had a flush catastrophically fail, its corresponding
> post-flush task will be stuck.
>
> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
>  wrote:
> > Just some very tiny amount of writes in the background here (some hints
> > spooled up on another node slowly coming in).
> > No new data.
> >
> > I thought there was no exceptions, but I did not look far enough back in
> the
> > log at first.
> > Going back a bit further now however, I see that about 50 hours ago:
> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> > Thread[CompactionExecutor:387,1,main]
> > java.io.IOException: No space left on device
> > at java.io.RandomAccessFile.writeBytes(Native Method)
> > at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> > at
> > org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> > at
> >
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> > at
> >
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> > at
> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> > at
> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> > at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:662)
> > [followed by a few more of those...]
> > and then a bunch of these:
> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> AbstractCassandraDaemon.java
> > (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
> > java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
> > space to flush 40009184 bytes
> > at
> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:662)
> > Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> > 40009184 bytes
> > at
> >
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> > at
> >
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> > at
> > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> > at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> > at
> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> > at
> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> > ... 3 more
> > Seems like compactions stopped after this (a bunch of tmp tables there
> still
> > from when those errors where generated), and I can only suspect the post
> > flusher may have stopped at the same time.
> > There is 890GB of disk for data, sstables are currently using 604G (139GB
> is
> > old tmp tables from when it ran out of disk) and "ring

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen

So yes, there is currently some 200GB empty disk.

On Wed, May 4, 2011 at 3:20 AM, Terje Marthinussen
wrote:

> Just some very tiny amount of writes in the background here (some hints
> spooled up on another node slowly coming in).
> No new data.
>
> I thought there was no exceptions, but I did not look far enough back in
> the log at first.
>
> Going back a bit further now however, I see that about 50 hours ago:
> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[CompactionExecutor:387,1,main]
> java.io.IOException: No space left on device
> at java.io.RandomAccessFile.writeBytes(Native Method)
> at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> at
> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> [followed by a few more of those...]
>
> and then a bunch of these:
> ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[FlushWriter:123,5,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
> space to flush 40009184 bytes
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> 40009184 bytes
> at
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> at
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> at
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> at
> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> ... 3 more
>
> Seems like compactions stopped after this (a bunch of tmp tables there
> still from when those errors where generated), and I can only suspect the
> post flusher may have stopped at the same time.
>
> There is 890GB of disk for data, sstables are currently using 604G (139GB
> is old tmp tables from when it ran out of disk) and "ring" tells me the load
> on the node is 313GB.
>
> Terje
>
>
>
> On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis  wrote:
>
>> ... and are there any exceptions in the log?
>>
>> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis  wrote:
>> > Does it resolve down to 0 eventually if you stop doing writes?
>> >
>> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
>> >  wrote:
>> >> Cassandra 0.8 beta trunk from about 1 week ago:
>> >> Pool NameActive   Pending  Completed
>> >> ReadStage 0 0  5
>> >> RequestResponseStage  0 0  87129
>> >> MutationStage 0 0 187298
>> >> ReadRepairStage   0 0  0
>> >> ReplicateOnWriteStage 0 0  0
>> >> GossipStage   0

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen

Just some very tiny amount of writes in the background here (some hints
spooled up on another node slowly coming in).
No new data.

I thought there was no exceptions, but I did not look far enough back in the
log at first.

Going back a bit further now however, I see that about 50 hours ago:
ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
AbstractCassandraDaemon.java (line 112) Fatal exception in thread
Thread[CompactionExecutor:387,1,main]
java.io.IOException: No space left on device
at java.io.RandomAccessFile.writeBytes(Native Method)
at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
at
org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
at
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[followed by a few more of those...]

and then a bunch of these:
ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java
(line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
space to flush 40009184 bytes
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Insufficient disk space to flush
40009184 bytes
at
org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
at
org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

Seems like compactions stopped after this (a bunch of tmp tables there still
from when those errors where generated), and I can only suspect the post
flusher may have stopped at the same time.

There is 890GB of disk for data, sstables are currently using 604G (139GB is
old tmp tables from when it ran out of disk) and "ring" tells me the load on
the node is 313GB.

Terje

On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis  wrote:

> ... and are there any exceptions in the log?
>
> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis  wrote:
> > Does it resolve down to 0 eventually if you stop doing writes?
> >
> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
> >  wrote:
> >> Cassandra 0.8 beta trunk from about 1 week ago:
> >> Pool NameActive   Pending  Completed
> >> ReadStage 0 0  5
> >> RequestResponseStage  0 0  87129
> >> MutationStage 0 0 187298
> >> ReadRepairStage   0 0  0
> >> ReplicateOnWriteStage 0 0  0
> >> GossipStage   0 01353524
> >> AntiEntropyStage  0 0  0
> >> MigrationStage0 0 10
> >> MemtablePostFlusher   1   190108
> >> StreamStage   0 0  0
> >> FlushWriter   0 0302
> >> FILEUTILS-DELETE-POOL 0 0 26
> >> MiscStage 0 0  0
> >> Flush

MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen

Cassandra 0.8 beta trunk from about 1 week ago:

Pool NameActive   Pending  Completed
ReadStage 0 0  5
RequestResponseStage  0 0  87129
MutationStage 0 0 187298
ReadRepairStage   0 0  0
ReplicateOnWriteStage 0 0  0
GossipStage   0 01353524
AntiEntropyStage  0 0  0
MigrationStage0 0 10
MemtablePostFlusher   1   190108
StreamStage   0 0  0
FlushWriter   0 0302
FILEUTILS-DELETE-POOL 0 0 26
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  0
HintedHandoff 1 4  7


Anyone with nice theories about the pending value on the memtable post
flusher?

Regards,
Terje

Re: memtablePostFlusher blocking writes?

2011-04-27 Thread Terje Marthinussen

It is a good question what is the problem here.
I dont think it is the pending mutations and flushes, the real problem is
what causes them, and it is not me!

There was maybe a misleading comment in my original mail.
It is not the hinted handoffs sent from this node that is the problem, but
the 1.6 million hints for this node that is being sent to it from another
(neighbour) node.

A more compact view what happens when the other node starts playing back the
hints. There is 2 seconds between the lines.
MutationStage   100   6152008362
MutationStage   100  12022054437
MutationStage   100  13942104971
MutationStage   100  93182142964
MutationStage   100 463922142964
MutationStage   100 831882142964
MutationStage   1001131562142964
MutationStage   1001490632142964
MutationStage   1001875142142964
MutationStage   1002262382142964
MutationStage   1001462672264194
MutationStage   1001412322314345
MutationStage   1001297302366987
MutationStage   1001285802412014
MutationStage   1001241012460093
MutationStage   1001190322509960
MutationStage   1001268882538537
MutationStage   1001630492538537
MutationStage   1001972432538537
MutationStage   1002315642538537
MutationStage   100 952122675457
MutationStage   100 430662727606
MutationStage26   1272779756
MutationStage   100  11152822694
MutationStage22222873449

I have two issues here.
- The massive amount of mutation caused by the hints playback
- The long periods where there are no increase in completed mutations.

Yes, there are other things going here when this happens, the system is
actually fairly busy, but has no problem handling the other traffic. The
RoundRobin Scheduler is also activated with 100 as throttle limit on a 12
node cluster, so there is no way at all that you should see 30k mutations
increasing on the pending side from the feeding side. This seems completely
triggered by the hints coming in.

I tried to reduce PAGE_SIZE to 5000 and set the throttle limit all the way
up to 5000ms, still it does not really seem to help much. It just take
longer time and it seems to be memtable flushing that blocks things.

Terje



On Thu, Apr 28, 2011 at 3:09 AM, Jonathan Ellis  wrote:

> MPF is indeed pretty lightweight, but since its job is to mark the
> commitlog replay position after a flush -- which has to be done in
> flush order to preserve correctness in failure scenarios -- you'll see
> the pending op count go up when you have multiple flushes happening.
> This is expected.
>
> Your real problem is the 17000 pending mutations, the 22 active +
> pending flushes, and probably compaction activity as well.
>
> (Also, if each of those pending mutations is 10,000 columns, you may
> be causing yourself memory pressure as well.)
>
> On Wed, Apr 27, 2011 at 11:01 AM, Terje Marthinussen
>  wrote:
> > 0.8 trunk:
> >
> > When playing back a fairly large chunk of hints, things basically locks
> up
> > under load.
> > The hints are never processed successfully. Lots of Mutations dropped.
> >
> > One thing is that maybe the default 10k columns per send with 50ms delays
> is
> > a bit on the aggressive side (10k*20 =200.000 columns in a second?), the
> > other thing is that it seems like the whole memtable flushing locks up.
> >
> > I tried to increase number of memtable flushers and queue a bit (8
> > concurrent flushers) to make things work, but no luck.
> >
> > Pool NameActive   Pending  Completed
> > ReadStage 0 0  1
> > RequestResponseStage  0 02236304
> > MutationStage   100 175644011533
> > ReadRepairStage   0 0  0
> > ReplicateOnWriteStage 0 0  0
> > GossipStage   0 0   2281
> > AntiEntropyStage  0 0  0
> > MigrationStage0 0  0
> > MemtablePostFlusher   113 50
> > StreamStage   0 0

memtablePostFlusher blocking writes?

2011-04-27 Thread Terje Marthinussen

0.8 trunk:

When playing back a fairly large chunk of hints, things basically locks up
under load.
The hints are never processed successfully. Lots of Mutations dropped.

One thing is that maybe the default 10k columns per send with 50ms delays is
a bit on the aggressive side (10k*20 =200.000 columns in a second?), the
other thing is that it seems like the whole memtable flushing locks up.

I tried to increase number of memtable flushers and queue a bit (8
concurrent flushers) to make things work, but no luck.

Pool NameActive   Pending  Completed
ReadStage 0 0  1
RequestResponseStage  0 02236304
MutationStage   100 175644011533
ReadRepairStage   0 0  0
ReplicateOnWriteStage 0 0  0
GossipStage   0 0   2281
AntiEntropyStage  0 0  0
MigrationStage0 0  0
MemtablePostFlusher   113 50
StreamStage   0 0  0
FlushWriter   814 73
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  0
HintedHandoff 1 8  3

A quick source code scan makes me believe that the MemtablePostFlusher
should not normally use a lot of time, but it seem like it does so here.
What may cause this?

Terje

Re: multithreaded compaction

2011-04-26 Thread Terje Marthinussen

To be honest, this started after feeding data to cassandra for a while with
compaction disabled (sort of a test case).

when I enabled it... boom... spectacular process with 2000% CPU usage
(please note... there is compression in cassandra in this system).

This system actually have SSD's so when throttled a bit, the I/O is really
not a problem, but I doubt that a HDD based system would have managed to
keep up.

I agree, this is hopefully something that does not normally happen, but then
again, some protection against Murphy's law is always good.

Thanks!
Terje

On Tue, Apr 26, 2011 at 4:35 PM, Sylvain Lebresne wrote:

> On Tue, Apr 26, 2011 at 9:01 AM, Terje Marthinussen
>  wrote:
> > Hi,
> > I was testing the multithreaded compactions and with 2x6 cores (24 with
> HT)
> > it does seem a bit crazy with 24 compactions running concurrently.
> > It is probably not very good in terms of random I/O.
>
> It does seems a bit overkill. However, I'm slightly curious how you
> ended up with 24 parallel
> compactions, more precisely, how did you end up with enough sstables
> to trigger 24
> compactions ? Was that done on purpose for testing sake, or did you
> saw that in a real
> situation ?
>
> I'm asking because in 'real' situation, given that compaction are
> triggered only if there is
> some number of files to compact, and provided the cluster is correctly
> provisioned, I wouldn't
> expect the number of parallel compaction to jump to such numbers (one
> of the goal of
> multi_treaded compaction was to make sure we never end up accumulating
> lots of un-compacted
> sstables). Anyway, I get your point, just wondering if that was a real
> situation.
>
> > As such, I think I agree with the argument in 2191 that there should be a
> > config option for this.
> > Probably a default that is dynamic with 1 thread per column family +2 or
> 3
> > thread for parallel compactions outside of that could be good.
> > Any other opinions?
>
> Multi-threaded compaction is optional and compaction throttling is
> supposed to mitigage
> it. However I do agree that too much many compactions may be a bad use
> of resources
> because of random IO even if correctly throttled. I think it's missing
> a configuration option
> "concurrent_compactions" like there is a "concurrent_writes|reads".
> For that, I have created
>  https://issues.apache.org/jira/browse/CASSANDRA-2558
>
> > I guess the compaction thread pool should also show up in tpstats?
>
> Yes it should ... and it will ... eventually :)
>
> Thanks for the feedback.
>
> --
> Sylvain
>

multithreaded compaction

2011-04-26 Thread Terje Marthinussen

Hi,

I was testing the multithreaded compactions and with 2x6 cores (24 with HT)
it does seem a bit crazy with 24 compactions running concurrently.
It is probably not very good in terms of random I/O.

As such, I think I agree with the argument in 2191 that there should be a
config option for this.
Probably a default that is dynamic with 1 thread per column family +2 or 3
thread for parallel compactions outside of that could be good.

Any other opinions?

I guess the compaction thread pool should also show up in tpstats?

Regards,
Terje

Re: 0.7.4 Bad sstables?

2011-04-25 Thread Terje Marthinussen

First column in the row has offset in the file of 190226525, last valid
column is at 380293592, about 181MB from first column to last.

in_memory_compaction_limit was 128MB, so almost certainly above the limit.

Terje

On Tue, Apr 26, 2011 at 8:53 AM, Terje Marthinussen  wrote:

> In my case, probably yes. From thw rows I have looked at, I think I have
> only seen this on rows with 1 million plus columns/supercolumns.
>
> May very well been larger than in memory limit. I think the compacted row I
> looked closer at was about 200MB and the in memory limit may have been
> 256MB.
>
> I will see if we still got files around to verify.
>
> Regards,
> Terje
>
> On 26 Apr 2011, at 02:08, Jonathan Ellis  wrote:
>
> > Was it on a "large" row?  (> in_memory_compaction_limit?)
> >
> > I'm starting to suspect that LazilyCompactedRow is computing row size
> > incorrectly in some cases.
> >
> > On Mon, Apr 25, 2011 at 11:47 AM, Terje Marthinussen
> >  wrote:
> >> I have been hunting similar looking corruptions, especially in the hints
> >> column family, but I believe it occurs somewhere while  compacting.
> >> I looked in greater detail on one sstable and the row length was longer
> than
> >> the actual data in the row, and as far as I could see, either the length
> was
> >> wrong or the row was missing data as there was was no extra data in the
> row
> >> after the last column.
> >> This was however on a somewhat aging dataset, so suspected it could be
> >> related to 2376.
> >>
> >> Playing around with 0.8 at the moment and not seen it there yet (bet
> it
> >> will show up tomorrow once I wrote that.. :))
> >> Terje
> >>
> >> On Tue, Apr 26, 2011 at 12:44 AM, Sanjeev Kulkarni <
> sanj...@locomatix.com>
> >> wrote:
> >>>
> >>> Hi Sylvain,
> >>> I started it from 0.7.4 with the patch 2376. No upgrade.
> >>> Thanks!
> >>>
> >>> On Mon, Apr 25, 2011 at 7:48 AM, Sylvain Lebresne <
> sylv...@datastax.com>
> >>> wrote:
> >>>>
> >>>> Hi Sanjeev,
> >>>>
> >>>> What's the story of the cluster ? Did you started with 0.7.4, or is it
> >>>> upgraded from
> >>>> some earlier version ?
> >>>>
> >>>> On Mon, Apr 25, 2011 at 5:54 AM, Sanjeev Kulkarni <
> sanj...@locomatix.com>
> >>>> wrote:
> >>>>> Hey guys,
> >>>>> Running a one node cassandra server with version 0.7.4 patched
> >>>>> with https://issues.apache.org/jira/browse/CASSANDRA-2376
> >>>>> The system was running fine for a couple of days when we started
> >>>>> noticing
> >>>>> something strange with cassandra. I stopped all applications and
> >>>>> restarted
> >>>>> cassandra. And then did a scrub. During scrub, I noticed these in the
> >>>>> logs
> >>>>> WARN [CompactionExecutor:1] 2011-04-24 23:37:07,561
> >>>>> CompactionManager.java
> >>>>> (line 607) Non-fatal error reading row (stacktrace follows)
> >>>>> java.io.IOError: java.io.IOException: Impossible row size
> >>>>> 1516029079813320210
> >>>>> at
> >>>>>
> >>>>>
> org.apache.cassandra.db.CompactionManager.doScrub(CompactionManager.java:589)
> >>>>> at
> >>>>>
> >>>>>
> org.apache.cassandra.db.CompactionManager.access$600(CompactionManager.java:56)
> >>>>>at
> >>>>>
> >>>>>
> org.apache.cassandra.db.CompactionManager$3.call(CompactionManager.java:195)
> >>>>> at
> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>at
> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>> at
> >>>>>
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>at
> >>>>>
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>at java.lang.Thread.run(Thread.java:662)
> >>>>> Caused by: java.io.IOException: Impossible row size
> 1516029079813320210
> >>>>>... 8 more
> >>>&

Re: 0.7.4 Bad sstables?

2011-04-25 Thread Terje Marthinussen

In my case, probably yes. From thw rows I have looked at, I think I have only 
seen this on rows with 1 million plus columns/supercolumns.

May very well been larger than in memory limit. I think the compacted row I 
looked closer at was about 200MB and the in memory limit may have been 256MB.

I will see if we still got files around to verify.

Regards,
Terje

On 26 Apr 2011, at 02:08, Jonathan Ellis  wrote:

> Was it on a "large" row?  (> in_memory_compaction_limit?)
> 
> I'm starting to suspect that LazilyCompactedRow is computing row size
> incorrectly in some cases.
> 
> On Mon, Apr 25, 2011 at 11:47 AM, Terje Marthinussen
>  wrote:
>> I have been hunting similar looking corruptions, especially in the hints
>> column family, but I believe it occurs somewhere while  compacting.
>> I looked in greater detail on one sstable and the row length was longer than
>> the actual data in the row, and as far as I could see, either the length was
>> wrong or the row was missing data as there was was no extra data in the row
>> after the last column.
>> This was however on a somewhat aging dataset, so suspected it could be
>> related to 2376.
>> 
>> Playing around with 0.8 at the moment and not seen it there yet (bet it
>> will show up tomorrow once I wrote that.. :))
>> Terje
>> 
>> On Tue, Apr 26, 2011 at 12:44 AM, Sanjeev Kulkarni 
>> wrote:
>>> 
>>> Hi Sylvain,
>>> I started it from 0.7.4 with the patch 2376. No upgrade.
>>> Thanks!
>>> 
>>> On Mon, Apr 25, 2011 at 7:48 AM, Sylvain Lebresne 
>>> wrote:
>>>> 
>>>> Hi Sanjeev,
>>>> 
>>>> What's the story of the cluster ? Did you started with 0.7.4, or is it
>>>> upgraded from
>>>> some earlier version ?
>>>> 
>>>> On Mon, Apr 25, 2011 at 5:54 AM, Sanjeev Kulkarni 
>>>> wrote:
>>>>> Hey guys,
>>>>> Running a one node cassandra server with version 0.7.4 patched
>>>>> with https://issues.apache.org/jira/browse/CASSANDRA-2376
>>>>> The system was running fine for a couple of days when we started
>>>>> noticing
>>>>> something strange with cassandra. I stopped all applications and
>>>>> restarted
>>>>> cassandra. And then did a scrub. During scrub, I noticed these in the
>>>>> logs
>>>>> WARN [CompactionExecutor:1] 2011-04-24 23:37:07,561
>>>>> CompactionManager.java
>>>>> (line 607) Non-fatal error reading row (stacktrace follows)
>>>>> java.io.IOError: java.io.IOException: Impossible row size
>>>>> 1516029079813320210
>>>>> at
>>>>> 
>>>>> org.apache.cassandra.db.CompactionManager.doScrub(CompactionManager.java:589)
>>>>> at
>>>>> 
>>>>> org.apache.cassandra.db.CompactionManager.access$600(CompactionManager.java:56)
>>>>>at
>>>>> 
>>>>> org.apache.cassandra.db.CompactionManager$3.call(CompactionManager.java:195)
>>>>> at
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>at
>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>> at
>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>at
>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>at java.lang.Thread.run(Thread.java:662)
>>>>> Caused by: java.io.IOException: Impossible row size 1516029079813320210
>>>>>... 8 more
>>>>>  INFO [CompactionExecutor:1] 2011-04-24 23:37:07,640
>>>>> CompactionManager.java
>>>>> (line 613) Retrying from row index; data is -1768177699 bytes starting
>>>>> at
>>>>> 2626524914
>>>>>  WARN [CompactionExecutor:1] 2011-04-24 23:37:07,641
>>>>> CompactionManager.java
>>>>> (line 633) Retry failed too.  Skipping to next row (retry's stacktrace
>>>>> follows)
>>>>> java.io.IOError: java.io.EOFException: bloom filter claims to be
>>>>> 1868982636
>>>>> bytes, longer than entire row size -1768177699at
>>>>> 
>>>>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:117)
>>>>>

Re: 0.8 loosing nodes?

2011-04-25 Thread Terje Marthinussen

Got just enough time to look at this done today to verify that:

Sometimes nodes (under pressure) fails to send heartbeats for  long
enough to get marked as dead by other nodes (why is a good question,
which I need to check better. Does not seem to be GC).

The node does however start sending heartbeats again and other nodes
log that they receive the heartbeats,  but this will not get it marked
as UP again until restarted.

So, seems like 2 issues:
- Nodes pausing (may be just node overload)
- Nodes are not marked as UP unless restarted

Regards,
Terje

On 24 Apr 2011, at 23:24, Terje Marthinussen  wrote:

> World as seen from .81 in the below ring
> .81 Up Normal  85.55 GB8.33%   Token(bytes[30])
> .82 Down   Normal  83.23 GB8.33%   Token(bytes[313230])
> .83 Up Normal  70.43 GB8.33%   Token(bytes[313437])
> .84 Up Normal  81.7 GB 8.33%   Token(bytes[313836])
> .85 Up Normal  108.39 GB   8.33%   Token(bytes[323336])
> .86 Up Normal  126.19 GB   8.33%   Token(bytes[333234])
> .87 Up Normal  127.16 GB   8.33%   Token(bytes[333939])
> .88 Up Normal  135.92 GB   8.33%   Token(bytes[343739])
> .89 Up Normal  117.1 GB8.33%   Token(bytes[353730])
> .90 Up Normal  101.67 GB   8.33%   Token(bytes[363635])
> .91 Down   Normal  88.33 GB8.33%   Token(bytes[383036])
> .92 Up Normal  129.95 GB   8.33%   Token(bytes[6a])
>
>
> From .82
> .81 Down   Normal  85.55 GB8.33%   Token(bytes[30])
> .82 Up Normal  83.23 GB8.33%   Token(bytes[313230])
> .83 Up Normal  70.43 GB8.33%   Token(bytes[313437])
> .84 Up Normal  81.7 GB 8.33%   Token(bytes[313836])
> .85 Up Normal  108.39 GB   8.33%   Token(bytes[323336])
> .86 Up Normal  126.19 GB   8.33%   Token(bytes[333234])
> .87 Up Normal  127.16 GB   8.33%   Token(bytes[333939])
> .88 Up Normal  135.92 GB   8.33%   Token(bytes[343739])
> .89 Up Normal  117.1 GB8.33%   Token(bytes[353730])
> .90 Up Normal  101.67 GB   8.33%   Token(bytes[363635])
> .91 Down   Normal  88.33 GB8.33%   Token(bytes[383036])
> .92 Up Normal  129.95 GB   8.33%   Token(bytes[6a])
>
> From .84
> 10.10.42.81 Down   Normal  85.55 GB8.33%   Token(bytes[30])
> 10.10.42.82 Down   Normal  83.23 GB8.33%   Token(bytes[313230])
> 10.10.42.83 Up Normal  70.43 GB8.33%   Token(bytes[313437])
> 10.10.42.84 Up Normal  81.7 GB 8.33%   Token(bytes[313836])
> 10.10.42.85 Up Normal  108.39 GB   8.33%   Token(bytes[323336])
> 10.10.42.86 Up Normal  126.19 GB   8.33%   Token(bytes[333234])
> 10.10.42.87 Up Normal  127.16 GB   8.33%   Token(bytes[333939])
> 10.10.42.88 Up Normal  135.92 GB   8.33%   Token(bytes[343739])
> 10.10.42.89 Up Normal  117.1 GB8.33%   Token(bytes[353730])
> 10.10.42.90 Up Normal  101.67 GB   8.33%   Token(bytes[363635])
> 10.10.42.91 Down   Normal  88.33 GB8.33%   Token(bytes[383036])
> 10.10.42.92 Up Normal  129.95 GB   8.33%   Token(bytes[6a])
>
> All of the nodes seems to be working when looked at individually and I can 
> see on for instance .84 that
>  INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) 
> InetAddress /.81 is now dead.
>
> but there is no other messages related to the nodes "dissappearing"  as far 
> as I can see in the 18 hours since that message occured.
>
> Restarting seems to recover things, but nodes seems to go away again (0.8 
> also seem to be prone to commit logs being unreadable in some cases?)
>
> This is 0.8 build from trunk last Friday.
>
> I will try to enable some more debugging tomorrow to see if there is 
> something interesting, just curious if anyone else had noticed something like 
> this.
>
> Regards,
> Terje
>
>

Re: 0.7.4 Bad sstables?

2011-04-25 Thread Terje Marthinussen

I have been hunting similar looking corruptions, especially in the hints
column family, but I believe it occurs somewhere while  compacting.

I looked in greater detail on one sstable and the row length was longer than
the actual data in the row, and as far as I could see, either the length was
wrong or the row was missing data as there was was no extra data in the row
after the last column.

This was however on a somewhat aging dataset, so suspected it could be
related to 2376.

Playing around with 0.8 at the moment and not seen it there yet (bet it
will show up tomorrow once I wrote that.. :))

Terje

On Tue, Apr 26, 2011 at 12:44 AM, Sanjeev Kulkarni wrote:

> Hi Sylvain,
> I started it from 0.7.4 with the patch 2376. No upgrade.
> Thanks!
>
>
> On Mon, Apr 25, 2011 at 7:48 AM, Sylvain Lebresne wrote:
>
>> Hi Sanjeev,
>>
>> What's the story of the cluster ? Did you started with 0.7.4, or is it
>> upgraded from
>> some earlier version ?
>>
>> On Mon, Apr 25, 2011 at 5:54 AM, Sanjeev Kulkarni 
>> wrote:
>> > Hey guys,
>> > Running a one node cassandra server with version 0.7.4 patched
>> > with https://issues.apache.org/jira/browse/CASSANDRA-2376
>> > The system was running fine for a couple of days when we started
>> noticing
>> > something strange with cassandra. I stopped all applications and
>> restarted
>> > cassandra. And then did a scrub. During scrub, I noticed these in the
>> logs
>> > WARN [CompactionExecutor:1] 2011-04-24 23:37:07,561
>> CompactionManager.java
>> > (line 607) Non-fatal error reading row (stacktrace follows)
>> > java.io.IOError: java.io.IOException: Impossible row size
>> > 1516029079813320210
>> > at
>> >
>> org.apache.cassandra.db.CompactionManager.doScrub(CompactionManager.java:589)
>> > at
>> >
>> org.apache.cassandra.db.CompactionManager.access$600(CompactionManager.java:56)
>> >at
>> >
>> org.apache.cassandra.db.CompactionManager$3.call(CompactionManager.java:195)
>> > at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>  at
>> > java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> > at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >at java.lang.Thread.run(Thread.java:662)
>> > Caused by: java.io.IOException: Impossible row size 1516029079813320210
>> >... 8 more
>> >  INFO [CompactionExecutor:1] 2011-04-24 23:37:07,640
>> CompactionManager.java
>> > (line 613) Retrying from row index; data is -1768177699 bytes starting
>> at
>> > 2626524914
>> >  WARN [CompactionExecutor:1] 2011-04-24 23:37:07,641
>> CompactionManager.java
>> > (line 633) Retry failed too.  Skipping to next row (retry's stacktrace
>> > follows)
>> > java.io.IOError: java.io.EOFException: bloom filter claims to be
>> 1868982636
>> > bytes, longer than entire row size -1768177699at
>> >
>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:117)
>> > at
>> >
>> org.apache.cassandra.db.CompactionManager.doScrub(CompactionManager.java:618)
>> >at
>> >
>> org.apache.cassandra.db.CompactionManager.access$600(CompactionManager.java:56)
>> > at
>> >
>> org.apache.cassandra.db.CompactionManager$3.call(CompactionManager.java:195)
>> >at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >  at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> > at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >at java.lang.Thread.run(Thread.java:662)
>> > Caused by: java.io.EOFException: bloom filter claims to be 1868982636
>> bytes,
>> > longer than entire row size -1768177699at
>> >
>> org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:116)
>> > at
>> >
>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:87)
>> >... 8 more
>> > WARN [CompactionExecutor:1] 2011-04-24 23:37:16,545
>> CompactionManager.java
>> > (line 607) Non-fatal error reading row (stacktrace follows)
>> > java.io.IOError: java.io.EOFException
>> > at
>> >
>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:144)
>> > at
>> >
>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:40)
>> > at
>> >
>> org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284)
>> > at
>> >
>> org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326)
>> > at
>> >
>> org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230)
>> > at
>> >
>> org.apache.cassandra.utils.R

0.8 loosing nodes?

2011-04-24 Thread Terje Marthinussen

World as seen from .81 in the below ring
.81 Up Normal  85.55 GB8.33%   Token(bytes[30])
.82 Down   Normal  83.23 GB8.33%   Token(bytes[313230])
.83 Up Normal  70.43 GB8.33%   Token(bytes[313437])
.84 Up Normal  81.7 GB 8.33%   Token(bytes[313836])
.85 Up Normal  108.39 GB   8.33%   Token(bytes[323336])
.86 Up Normal  126.19 GB   8.33%   Token(bytes[333234])
.87 Up Normal  127.16 GB   8.33%   Token(bytes[333939])
.88 Up Normal  135.92 GB   8.33%   Token(bytes[343739])
.89 Up Normal  117.1 GB8.33%   Token(bytes[353730])
.90 Up Normal  101.67 GB   8.33%   Token(bytes[363635])
.91 Down   Normal  88.33 GB8.33%   Token(bytes[383036])
.92 Up Normal  129.95 GB   8.33%   Token(bytes[6a])


>From .82
.81 Down   Normal  85.55 GB8.33%   Token(bytes[30])
.82 Up Normal  83.23 GB8.33%   Token(bytes[313230])
.83 Up Normal  70.43 GB8.33%   Token(bytes[313437])
.84 Up Normal  81.7 GB 8.33%   Token(bytes[313836])
.85 Up Normal  108.39 GB   8.33%   Token(bytes[323336])
.86 Up Normal  126.19 GB   8.33%   Token(bytes[333234])
.87 Up Normal  127.16 GB   8.33%   Token(bytes[333939])
.88 Up Normal  135.92 GB   8.33%   Token(bytes[343739])
.89 Up Normal  117.1 GB8.33%   Token(bytes[353730])
.90 Up Normal  101.67 GB   8.33%   Token(bytes[363635])
.91 Down   Normal  88.33 GB8.33%   Token(bytes[383036])
.92 Up Normal  129.95 GB   8.33%   Token(bytes[6a])

>From .84
10.10.42.81 Down   Normal  85.55 GB8.33%   Token(bytes[30])
10.10.42.82 Down   Normal  83.23 GB8.33%   Token(bytes[313230])
10.10.42.83 Up Normal  70.43 GB8.33%   Token(bytes[313437])
10.10.42.84 Up Normal  81.7 GB 8.33%   Token(bytes[313836])
10.10.42.85 Up Normal  108.39 GB   8.33%   Token(bytes[323336])
10.10.42.86 Up Normal  126.19 GB   8.33%   Token(bytes[333234])
10.10.42.87 Up Normal  127.16 GB   8.33%   Token(bytes[333939])
10.10.42.88 Up Normal  135.92 GB   8.33%   Token(bytes[343739])
10.10.42.89 Up Normal  117.1 GB8.33%   Token(bytes[353730])
10.10.42.90 Up Normal  101.67 GB   8.33%   Token(bytes[363635])
10.10.42.91 Down   Normal  88.33 GB8.33%   Token(bytes[383036])
10.10.42.92 Up Normal  129.95 GB   8.33%   Token(bytes[6a])

All of the nodes seems to be working when looked at individually and I can
see on for instance .84 that
 INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611)
InetAddress /.81 is now dead.

but there is no other messages related to the nodes "dissappearing"  as far
as I can see in the 18 hours since that message occured.

Restarting seems to recover things, but nodes seems to go away again (0.8
also seem to be prone to commit logs being unreadable in some cases?)

This is 0.8 build from trunk last Friday.

I will try to enable some more debugging tomorrow to see if there is
something interesting, just curious if anyone else had noticed something
like this.

Regards,
Terje

multithreaded compaction causes mutation storms?

2011-04-24 Thread Terje Marthinussen

Tested out multithreaded compaction in 0.8 last night.

We had first fed some data with compaction disabled so there was 1000+
sstables on the nodes and I decided to enable multithreaded compaction on
one of them to see how it performed vs. nodes that had no compaction at all.

Since this was sort of to see how it could perform, I set throughput to
128MB/sec (knowing that this was probably a bit more than it could manage)

It quickly generated 24 tmp files for the main CF (24 compaction threads?),
the CPUs got maxed out 90% (2x6 cores) and I started seeing these

 INFO [FlushWriter:1] 2011-04-24 03:07:46,776 Memtable.java (line 238)
Writing Memtable-Test@757483679(23549136/385697094 serialized/live bytes,
32026 ops)
 WARN [ScheduledTasks:1] 2011-04-24 03:07:46,946 MessagingService.java (line
548) Dropped 36506 MUTATION messages in the last 5000ms
 INFO [ScheduledTasks:1] 2011-04-24 03:07:46,947 StatusLogger.java (line 50)
Pool NameActive   Pending
 INFO [ScheduledTasks:1] 2011-04-24 03:07:46,947 StatusLogger.java (line 65)
ReadStage 0 0
 INFO [ScheduledTasks:1] 2011-04-24 03:07:46,948 StatusLogger.java (line 65)
RequestResponseStage  0 3
 INFO [ScheduledTasks:1] 2011-04-24 03:07:46,948 StatusLogger.java (line 65)
ReadRepairStage   0 0
 INFO [ScheduledTasks:1] 2011-04-24 03:07:46,948 StatusLogger.java (line 65)
MutationStage10 39549
 INFO [ScheduledTasks:1] 2011-04-24 03:07:46,949 StatusLogger.java (line 65)
ReplicateOnWriteStage 0 0

That the system is a bit overloaded is not really the question (I wanted to
find out what it could manage), but the curious part is that when checking
tpstats, the Mutation stage was mostly idle, however at seemingly regular
intervals it would get massive amounts of mutations.

Not sure if it could be related, but the log message always showed up just
before the "StatusLogger" printout (but not necessarily before all of them)

Some sort of internal event occuring causing these mutation storms or
something which ends up synchronizing the compaction threads in a way that
causes mutations storms like these?

The messages went away a little while after reducing the throughput
significantly to 6MB/sec...

It does not seem to be a problem normally, just when doing something extreme
like enabling multithreaded compaction when you have hundreds or thousands
of memtables already.

Regards,
Terje

Re: Compacting single file forever

2011-04-22 Thread Terje Marthinussen

I think the really interesting part is how this node ended up in this state
in the first place.

There should be somewhere in the area of 340-500GB of data on it in when
everything is 100% compacted.
Problem now is that it used (we wiped it last night to test some 0.8 stuff)
more then 1TB.

To me, it seems like there are some nasty potential worst cases you can get.

Lets say you are in a fine spot. You have 1TB disk.
All data compacted into one sstable and your data uses 300GB.

Now you issue a repair, and disk usage start increasing while at the same
time there are some events that updates a fairly large amount of
non-overlapping data for this node or the 2 node it has replicated data for
so you end up with large sstables of similar size, but if you try to compact
them, you end up essentially with almost a full dataset.

That is, you end up in a situation where the only sstables it tries to merge
is
sstable1 which has keys 1,2,3
sstable2 which has keys 4,5,6

That is, in a worst case, we have:
- We need 300GB for the original compacted sstable
- An unknown amount of data from the repair
- An unknown amount of duplicates in smaller sstables
- We then in worst case need space for the sum of the 2 sstables it tries to
merge (which are in the 170GB region each in this case)

I think something like this has happened here and it has eventually ended up
in a situation where it cannot recover even though the total disk space is
somewhere 4-6 times the size the optimally  compacted data size.

This  is something of a worst case scenario, but it ends up in a situation
that is unrecoverable which is not good.

Only way I can think of avoiding this is to segment the sstables based on
key range so you never get sstables that requires up to 50% of the disk to
compact and you have a higher probability that compacted sstables have same
keys.

Maybe split this in directories named on token ranges or just prefix the
sstable names with the token range is a prefix of the file name so very
little overhead is added to look up data.

Someting like
MyCF_00-08_Data.db
MyCF_08-ff_Data.db

where 00-08 is the token range of the keys in that sstable. These ranges
could be changing as compaction occurs to keep balance and avoid that any
single sstable gets very large

Terje

On Fri, Apr 22, 2011 at 1:56 PM, Jonathan Ellis  wrote:

> I suggest as a workaround making the forceUserDefinedCompaction method
> ignore disk space estimates and attempt the requested compaction even
> if it guesses it will not have enough space. This would allow you to
> submit the 2 sstables you want manually.
>
> On Thu, Apr 21, 2011 at 8:34 AM, Shotaro Kamio 
> wrote:
> > Hi Aaron,
> >
> >
> > Maybe, my previous description was not good. It's not a compaction
> > threshold problem.
> > In fact, Cassandra tries to compact 7 sstables in the minor
> > compaction. But it decreases the number of sstables one by one due to
> > insufficient disk space. At the end, it compacts a single file as in
> > the new log below.
> >
> > Compactionstats on a node says:
> >
> >  compaction type: Minor
> >  column family: foobar
> >  bytes compacted: 133473101929
> >  bytes total in progress: 17743825
> >  pending tasks: 12
> >
> > The disk usage reaches 78%. It's really tough situation. But I guess
> > the data contains a lot of duplicates. because we feed same data again
> > and again and do repair.
> >
> >
> > Another thing I'm wondering is a file selection algorithm.
> > For example, one of disks has 235G free space. It contains sstables of
> > 61G, 159G, 191G, 196G, 197G. The one cassandra trying to compact
> > forever is 159G sstable. But there is smaller sstable. It should try
> > compacting 61G + 159G ideally.
> > A more intelligent algorithm is required to find optimal combination.
> > And if cassandra knows statistics about number of deleted data and old
> > data to be compacted for sstables, it should be useful to find more
> > efficient file combination.
> >
> >
> > Regards,
> > Shotaro
> >
> >
> >
> > * Minor compaction log
> > -
> >  WARN [CompactionExecutor:1] 2011-04-21 21:44:08,554
> > CompactionManager.java (line 405) insufficient space to compact all
> > requested files SSTableReader(path='foobar-f-773-Data.db'),
> > SSTableReader(path='foobar-f-1452-Data.db'),
> > SSTableReader(path='foobar-f-1620-Data.db'),
> > SSTableReader(path='foobar-f-1642-Data.db'),
> > SSTableReader(path='foobar-f-1643-Data.db'),
> > SSTableReader(path='foobar-f-1690-Data.db'),
> > SSTableReader(path='foobar-f-1814-Data.db')
> >  WARN [CompactionExecutor:1] 2011-04-21 21:44:28,565
> > CompactionManager.java (line 405) insufficient space to compact all
> > requested files SSTableReader(path='foobar-f-773-Data.db'),
> > SSTableReader(path='foobar-f-1452-Data.db'),
> > SSTableReader(path='foobar-f-1642-Data.db'),
> > SSTableReader(path='foobar-f-1643-Data.db'),
> > SSTableReader(path='foobar-f-1690-Data.db'),
> > SSTableReader(path='foobar-f-1814-Data.db')
> >  WARN [CompactionE

Re: Multi-DC Deployment

2011-04-20 Thread Terje Marthinussen

Sure, the update queue could just as well replicate problems, but the queue
would be a lot simpler than cassandra and it would not modify already
acknowledged data like like for instance compaction or read-repair/hint
deliveries may. There is a fair bit of re-writing/re-assemblying of data
even though it is actually never updated since the original data was
acknowledged. From a statistically viewpoint, there is clearly a noticeable
risk in data getting messed up internally in Cassandra during these
operations.

Of course, in a perfectly replicated case, the same error is likely to occur
even in two isolated systems if they have the exact same data in the same
order,  but lets add more fun things that replicate well such as operator
mistakes. Yes, isolated systems increases maintenance work, and increase in
work increases risk of mistakes, but it reduces risk that you do a mistake
that brings down everything.

Bottom line is, I am far from convinced about the benefits of one big
magical system across all datacenters vs. more isolated setups.
It is not a Cassandra specific concern though. It applies to any system.

Backups needs to be there anyway of course, but if you have a system that is
somewhat working ok, but compactions have stopped due to a bad sstable, how
do you recover that from backup without taking down the service or going
through the interesting task of splitting up the cassandra setup, recover on
a limited set of nodes and then re-joining again? (maybe easier to do with
datacenter replication, haven't actually thought about how to do it there).

Yes, I realize that I could do 2 queries (or more) to get the
wanted behavior, but doubling the number of queries when the system is
already in trouble is rarely a good idea.

Terje

On Thu, Apr 21, 2011 at 10:49 AM, Adrian Cockcroft <
adrian.cockcr...@gmail.com> wrote:

> Queues replicate bad data just as well as anything else. The biggest
> source of bad data is broken app code... You will still need to
> implement a reconciliation/repair checker, as queues have their own
> failure modes when they get backed up. We have also looked at using
> queues to bounce data between cassandra clusters for other reasons,
> and they have their place. However it is a lot more work to implement
> than using existing well tested Cassandra functionality to do it for
> us.
>
> I think your code needs to retry a failed local-quorum read with a
> read-one to get the behavior you are asking for.
>
> Our approach to bad data and corruption issues is backups, wind back
> to the last good snapshot. We have figured out incremental backups as
> well as full. Our code has some local dependencies, but could be the
> basis for a generic solution.
>
> Adrian
>
> On Wed, Apr 20, 2011 at 6:08 PM, Terje Marthinussen
>  wrote:
> > Assuming that you generally put an API on top of this, delivering to two
> or
> > more systems then boils down to a message queue issue or some similar
> > mechanism which handles secure delivery of messages. Maybe not trivial,
> but
> > there are many products that can help you with this, and it is a lot
> easier
> > to implement than a fully distributed storage system.
> > Yes, ideally Cassandra will not distribute corruption, but the reason you
> > pay up to have 2 fully redundant setups in 2 different datacenters is
> > because we do not live in an ideal world. Anyone having tested Cassandra
> > since 0.7.0 with any real data will be able to testify how well it can
> mess
> > things up.
> > This is not specific to Cassandra, in fact, I would argue thats this is
> in
> > the blood of any distributed system. You want them to distribute after
> all
> > and the tighter the coupling is between nodes, the better they distribute
> > bad stuff as well as good stuff.
> > There is a bigger risk for a complete failure with 2 tightly coupled
> > redundant systems than with 2 almost completely isolated ones. The logic
> > here is so simple it is really somewhat beyond discussion.
> > There are a few other advantages of isolating the systems. Especially in
> > terms of operation, 2 isolated systems would be much easier as you could
> > relatively risk fee try out a new cassandra in one datacenter or upgrade
> one
> > datacenter at a time if you needed major operational changes such as
> schema
> > changes or other large changes to the data.
> > I see the 2 copies in one datacenters + 1(or maybe 2) in another as a
> "low
> > cost" middleway between 2 full N+2 (RF=3) systems in both data centers.
> > That is, in a traditional design where you need 1 node for normal
> service,
> > you would have 1 extra replicate for redundancy and one replica more (N+2
> > redundancy) so you can do maintenanc

Re: Multi-DC Deployment

2011-04-20 Thread Terje Marthinussen

Assuming that you generally put an API on top of this, delivering to two or
more systems then boils down to a message queue issue or some similar
mechanism which handles secure delivery of messages. Maybe not trivial, but
there are many products that can help you with this, and it is a lot easier
to implement than a fully distributed storage system.

Yes, ideally Cassandra will not distribute corruption, but the reason you
pay up to have 2 fully redundant setups in 2 different datacenters is
because we do not live in an ideal world. Anyone having tested Cassandra
since 0.7.0 with any real data will be able to testify how well it can mess
things up.

This is not specific to Cassandra, in fact, I would argue thats this is in
the blood of any distributed system. You want them to distribute after all
and the tighter the coupling is between nodes, the better they distribute
bad stuff as well as good stuff.

There is a bigger risk for a complete failure with 2 tightly coupled
redundant systems than with 2 almost completely isolated ones. The logic
here is so simple it is really somewhat beyond discussion.

There are a few other advantages of isolating the systems. Especially in
terms of operation, 2 isolated systems would be much easier as you could
relatively risk fee try out a new cassandra in one datacenter or upgrade one
datacenter at a time if you needed major operational changes such as schema
changes or other large changes to the data.

I see the 2 copies in one datacenters + 1(or maybe 2) in another as a "low
cost" middleway between 2 full N+2 (RF=3) systems in both data centers.

That is, in a traditional design where you need 1 node for normal service,
you would have 1 extra replicate for redundancy and one replica more (N+2
redundancy) so you can do maintenance and still be redundant.

If I have redundancy across datacenters, I would probably still want 2
replicas to avoid network traffic between DCs in case of a node recovery,
but N+2 may not be needed as my risk policy may find it acceptable to run
one datacenters without redundancy for a time limited period for
maintenance.

That is, if my original requirement is 1 node, I could do with 3x the HW
which is not all that much more than the 3x I need for one DC and a lot less
than the 6x I need for 2 full N+2 systems.

However, all of the above is really beyond the point of my original
suggestion.

Regardless of datacenters, redundancy and distribution of bad or good stuff,
it would be good to have a way to return whatever data is there, but with a
flag or similar stating that the consistency level was not met.

Again, for a lot of services, it is fully acceptable, and a lot better, to
return an almost complete (or maybe even complete, but no verified by
quorum) result than no result at all.

As far as I remember from the code, this just boils down to returning
whatever you collected from the cluster and setting the proper flag or
similar on the resultset rather than returning an error.

Terje

On Thu, Apr 21, 2011 at 5:01 AM, Adrian Cockcroft <
adrian.cockcr...@gmail.com> wrote:

> Hi Terje,
>
> If you feed data to two rings, you will get inconsistency drift as an
> update to one succeeds and to the other fails from time to time. You
> would have to build your own read repair. This all starts to look like
> "I don't trust Cassandra code to work, so I will write my own buggy
> one off versions of Cassandra functionality". I lean towards using
> Cassandra features rather than rolling my own because there is a large
> community testing, fixing and extending Cassandra, and making sure
> that the algorithms are robust. Distributed systems are very hard to
> get right, I trust lots of users and eyeballs on the code more than
> even the best engineer working alone.
>
> Cassandra doesn't "replicate sstable corruptions". It detects corrupt
> data and only replicates good data. Also data isn't replicated to
> three identical nodes in the way you imply, it's replicated around the
> ring. If you lose three nodes, you don't lose a whole node's worth of
> data.  We configure each replica to be in a different availability
> zone so that we can lose a third of our nodes (a whole zone) and still
> work. On a 300 node system with RF=3 and no zones, losing one or two
> nodes you still have all your data, and can repair the loss quickly.
> With three nodes dead at once you don't lose 1% of the data (3/300) I
> think you lose 1/(300*300*300) of the data (someone check my math?).
>
> If you want to always get a result, then you use "read one", if you
> want to get a highly available better quality result use local quorum.
> That is a per-query option.
>
> Adrian
>
> On Tue, Apr 19, 2011 at 6:46 PM, Terje Marthinussen
>  wrote:
> > If you have RF=3 in both datacenters, it could be discuss

Re: Multi-DC Deployment

2011-04-19 Thread Terje Marthinussen

If you have RF=3 in both datacenters, it could be discussed if there is a
point to use the built in replication in Cassandra at all vs. feeding the
data to both datacenters and get 2 100% isolated cassandra instances that
cannot replicate sstable corruptions between each others

My point is really a bit more general though.

For a lot services (especially Internet based ones) 100% accuracy in terms
of results is not needed (or maybe even expected)
While you want to serve a 100% correct result if you can (using quorum), it
is still much better to serve a partial result than no result at all.

Lets say you have 300 nodes in your ring, one document manages to trigger a
bug in cassandra that brings down a node with all its replicas (3 nodes
down)

For many use cases, it would be much better to return the remaining 99% of
the data coming from the 297 working nodes than having a service which
returns nothing at all.

I would however like the frontend to realize that this is an incomplete
result so it is possible for it to react accordingly as well as be part of
monitoring of the cassandra ring.

Regards,
Terje

On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft <
adrian.cockcr...@gmail.com> wrote:

> If you want to use local quorum for a distributed setup, it doesn't
> make sense to have less than RF=3 local and remote. Three copies at
> both ends will give you high availability. Only one copy of the data
> is sent over the wide area link (with recent versions).
>
> There is no need to use mirrored or RAID5 disk in each node in this
> case, since you are using RAIN (N for nodes) to protect your data. So
> the extra disk space to hold three copies at each end shouldn't be a
> big deal. Netflix is using striped internal disks on EC2 nodes for
> this.
>
> Adrian
>
> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
>  wrote:
> > Hum...
> > Seems like it could be an idea in a case like this with a mode where
> result
> > is always returned (if possible), but where a flay saying if the
> consistency
> > level was met, or to what level it was met (number of nodes answering for
> > instance).?
> > Terje
> >
> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis 
> wrote:
> >>
> >> They will timeout until failure detector realizes the DC1 nodes are
> >> down (~10 seconds). After that they will immediately return
> >> UnavailableException until DC1 comes back up.
> >>
> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
> >>  wrote:
> >> > We are planning to deploy Cassandra on two data centers.   Let us say
> >> > that
> >> > we went with three replicas with 2 being in one data center and last
> >> > replica
> >> > in 2nd Data center.
> >> >
> >> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
> >> > replicas are unreachable)?  Will they timeout?
> >> >
> >> >
> >> > Regards,
> >> > Baskar
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of DataStax, the source for professional Cassandra support
> >> http://www.datastax.com
> >
> >
>

Re: Multi-DC Deployment

2011-04-18 Thread Terje Marthinussen

Hum...

Seems like it could be an idea in a case like this with a mode where result
is always returned (if possible), but where a flay saying if the consistency
level was met, or to what level it was met (number of nodes answering for
instance).?

Terje

On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis  wrote:

> They will timeout until failure detector realizes the DC1 nodes are
> down (~10 seconds). After that they will immediately return
> UnavailableException until DC1 comes back up.
>
> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
>  wrote:
> > We are planning to deploy Cassandra on two data centers.   Let us say
> that
> > we went with three replicas with 2 being in one data center and last
> replica
> > in 2nd Data center.
> >
> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
> > replicas are unreachable)?  Will they timeout?
> >
> >
> > Regards,
> > Baskar
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: raid 0 and ssd

2011-04-14 Thread Terje Marthinussen

Hm...

You should notice that unless you have TRIM, which I don't think any OS
support with any raid functionality yet, then once you have written once to
the whole SSD, it is always full!

That is, when you delete a file, you don't "clear" the blocks on the SSD so
as far as the SSD goes, the data is still there.

The latest SSDs are pretty good at dealing with this though, and some can be
made a lot better by allocating extra spare block area for GC.

Also be careful with raids and things like scrubbing or initialization of
the Raid. This may very well "fill it 100%" :)

Terje

On Thu, Apr 14, 2011 at 2:02 PM, Drew Kutcharian  wrote:

> RAID 0 is the fastest, but you'll lose the whole array if you lose a drive.
> One thing to keep in mind is that SSDs get slower as they get filled up and
> closer to their capacity due to garbage collection.
>
> If you want more info on how SSDs perform in general, Percona guys have
> done extensive tests. (In addition to comparing all the raid levels and etc.
>
> http://www.percona.com/docs/wiki/benchmark:ssd:start
>
> http://www.mysqlperformanceblog.com/2009/05/01/raid-vs-ssd-vs-fusionio/(see 
> the "RELATED SEARCHES" on the right side too)
>
> - Drew
>
>
> On Apr 13, 2011, at 9:42 PM, Anurag Gujral wrote:
>
> > Hi All,
> >We are using three ssd disks with cassandra 0.7.3 , should we
> set them as raid0 .What are the advantages and disadvantages of doing this.
> > Please advise.
> >
> > Thanks
> > Anurag
>
>

value of hinted handoff column not really empty...?

2011-04-13 Thread Terje Marthinussen

Hi,

we do see occasional row corruptions now and then and especially in hinted
handoffs.

This may be related to fairly long rows (millions of columns)

I was dumping one corrupted hint .db file and I noticed that they do in fact
have values.

The doc say
Subcolumn values are always empty; instead, we store the row data "normally"

The code does
add(path, ByteBufferUtil.EMPTY_BYTE_BUFFER,
System.currentTimeMillis(), cf.metadata().getGcGraceSeconds());

and if you run sstable2json you will see that columns have values like
 "4d8eb49d",

I guess an EMPTY_BYTE_BUFFER is not entirely an empty value. Not such a big
deal, but it may be that we are wasting 4 bytes per hint here?

Just a curiosity I thought I would mention before I forget it.

Terje

Re: Timeout during stress test

2011-04-11 Thread Terje Marthinussen

I notice you have pending hinted handoffs?

Look for errors related to that. We have seen occasional corruptions in the
hinted handoff sstables,

If you are stressing the system to its limits, you may also consider playing
with more with the number of  read/write threads  (concurrent_reads/writes)
as well as rate limiting the number of requests you can get per node
(throttle limit).

We have seen similar issue when sending large number of requests to a
cluster (read/write threads running out, timeouts, nodes marked as down).

Terje

We have seen similar issues when

On Tue, Apr 12, 2011 at 9:56 AM, aaron morton wrote:

> It means the cluster is currently overloaded and unable to complete
> requests in time at the CL specified.
>
> Aaron
>
> On 12 Apr 2011, at 11:18, mcasandra wrote:
>
> > It looks like hector did retry on all the nodes and failed. Does this
> then
> > mean cassandra is down for clients in this scenario? That would be bad.
> >
> > --
> > View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html
> > Sent from the cassandra-u...@incubator.apache.org mailing list archive
> at Nabble.com.
>
>

Re: balance between concurrent_[reads|writes] and feeding/reading threads i clients

2011-04-01 Thread Terje Marthinussen

The reason I am asking is obviously that we saw a bunch of stability issues
for a while.
We had some periods with a lot of dropped messages, but also a bunch of
dead/UP messages without drops (followed by hintedhandoffs) and loads of
read repairs.

This all seems to work a lot better after increasing to 100 read/mutation
threads per node (12 node cluster).
 I am not entirely this is a "just forget" matter. I would much prefer that
cassandra was able to somehow tell the client that "this is too fast! Please
slow down" instead of having an infinitely large queue. Queuing data rarely
get you out of trouble (only gets you into more), so queues should in my
opinion never get longer than what is needed to smoothen out peaks in
traffic.

It may also sane to make some thread pool allocation or priority handling
for read/writes coming in through internal communication to avoid triggering
hints or read repairs without good reason.

I guess I could play around with throttle limit in the roundrobin schdeduler
for this.

Regards,
Terje

On Tue, Mar 29, 2011 at 7:42 PM, aaron morton wrote:

> The concurrent_reads and concurrent_writes set the number of threads in the
> relevant thread pools. You can view the number of active and queued tasks
> using nodetool tpstats.
>
> The thread pool uses a blocking linked list for it's work queue with a max
> size of Integer.MAX_VALUE. So it's size is essentially unbounded. When
> (internode) messages are received by a node they are queued into the
> relevant thread pool for processing. When (certain) messages are executed it
> checks the send time of the message and will not process it if it is more
> than rpc_timeout (typically 10 seconds) old. This is where the "dropped
> messages" log messages come from.
>
> The coordinator will wait up to rpc_timeout for the CL number of nodes to
> respond. So if say one node is under severe load and cannot process the read
> in time, but the others are ok a request at QUORUM would probably succeed.
> However if a number of nodes are getting a beating the co-ordinator may time
> out resulting in the client getting a TimedOutException.
>
> For the read path it's a little more touchy. Only the "nearest" node is
> sent a request for the actual data, the others are asked for a digest of the
> data they would return. So if the "nearest" node is the one under load and
> times out the request will time out even if CL nodes returned. Thats what
> the DynamicSnitch is there for, a node under load would less likely to be
> considered the "nearest" node.
>
> The read and write thread pools are really just dealing with reading and
> writing data on the local machine. Your request moves through several other
> threads / thread pools: connection thread, outbound TCP pool, inbound TCP
> pool and message response pool. The SEDA paper referenced on this page was
> the model for using thread pools to manage access to resources
> http://wiki.apache.org/cassandra/ArchitectureInternals
>
> In summary, don't worry about it unless you see the thread pools backing up
> and messages being dropped.
>
> Hope that helps
> Aaron
>
> On 28 Mar 2011, at 19:55, Terje Marthinussen wrote:
>
> Hi,
>
> I was pondering about how the concurrent_read and write settings balances
> towards max read/write threads in clients.
>
> Lets say we have 3 nodes, and concurrent read/write set to 8.
> That is, 8*3=24 threads for reading and writing.
>
> Replication factor is 3.
>
> Lets say we have clients that in total set up 16 connections to each node.
>
> Now all the clients write at the same time. Since the replication factor is
> 3, you could get up to 16*3=48  concurrent write request per node (which
> needs to be handled by 8 threads)?
>
> What is the result if this load continues?
> Could you see that replication of data fails (at least initially) causing
> all kinds of fun timeouts around in the system?
>
> Same on the read side.
> If all clients read at the same time with Consistency level QUORUM, you get
> 16*2 read requests in best case (and more in worst case)?
>
> Could you see that one node answers, but another one times out due to lack
> of read threads, causing read repair which again further degrades?
>
> How does this queue up internally between thrift, gossip and the threads
> doing the actual read and writes?
>
> Regards,
> Terje
>
>
>

Re: How to repair HintsColumnFamily?

2011-04-01 Thread Terje Marthinussen

Seeing similar errors on another system (0.7.4). Maybe something bogus with
the hint columnfamilies.

Terje

On Mon, Mar 28, 2011 at 7:15 PM, Shotaro Kamio  wrote:

> I see. Then, I'll remove the HintsColumnFamily.
>
> Because our cluster has a lot of data, running repair takes much time
> (more than a day). And it's a kind of pain. It often causes disk full,
> creates many sstables and degrades read performance.
> If it's easy to fix the hint, it could be less painful solution. But I
> understand there's no other option in this case.
>
>
> Thanks,
> Shotaro
>
>
> On Sun, Mar 27, 2011 at 11:51 PM, Jonathan Ellis 
> wrote:
> > Why would you try to repair hints?
> >
> > If you run repair on the non-system data then you don't need the hint
> > data and can remove it.
> >
> > On Sun, Mar 27, 2011 at 12:17 AM, Shotaro Kamio 
> wrote:
> >> Hi,
> >>
> >> Our cluster uses cassandra 0.7.4 (upgraded from 0.7.3) with
> >> replication = 3. I found that error occurs on one node during hinted
> >> handoff with following error (log #1 below).
> >> When I tried out "scrub system HintsColumnFamily", I saw an ERROR in
> >> log (log #2 below).
> >> Do you think these errors are critical ?
> >> I tried to "repair system HintsColumnFamily". But, it refuses to run
> >> with "No neighbors". I can understand because hints are not
> >> replicated. But then, is there any way to fix it without data loss?
> >>
> >>  INFO [manual-repair-0996a2ec-26d3-4243-9586-d56daf30f9bd] 2011-03-27
> >> 13:55:05,664 AntiEntropyService.java (line 752) No neighbors to repair
> >> with: manual-repair-0996a2ec-26d3-4243-9586-d56daf30f9bd completed.
> >>
> >>
> >> Best regards,
> >> Shotaro
> >>
> >>
> >>  Log #1: Error on hinted handoff
> >> 
> >>
> >> ERROR [HintedHandoff:1] 2011-03-26 20:04:22,528
> >> DebuggableThreadPoolExecutor.java (line 103) Error in
> >> ThreadPoolExecutor
> >> java.lang.RuntimeException: java.lang.RuntimeException: error reading
> >> 4976040 of 4976067
> >>at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> >>at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>at java.lang.Thread.run(Thread.java:662)
> >> Caused by: java.lang.RuntimeException: error reading 4976040 of 4976067
> >>at
> org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:83)
> >>at
> org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
> >>at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
> >>at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
> >>at
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:108)
> >>at
> org.apache.commons.collections.iterators.CollatingIterator.anyHasNext(CollatingIterator.java:364)
> >>at
> org.apache.commons.collections.iterators.CollatingIterator.hasNext(CollatingIterator.java:217)
> >>at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:63)
> >>at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
> >>at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
> >>at
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:116)
> >>at
> org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(QueryFilter.java:130)
> >>at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1368)
> >>at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1245)
> >>at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1173)
> >>at
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:321)
> >>at
> org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:88)
> >>at
> org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:409)
> >>at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> >>... 3 more
> >> Caused by: java.io.EOFException
> >>at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591)
> >>at
> org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.java:324)
> >>at
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:335)
> >>at
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:351)
> >>at
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:311)
> >

balance between concurrent_[reads|writes] and feeding/reading threads i clients

2011-03-28 Thread Terje Marthinussen

Hi,

I was pondering about how the concurrent_read and write settings balances
towards max read/write threads in clients.

Lets say we have 3 nodes, and concurrent read/write set to 8.
That is, 8*3=24 threads for reading and writing.

Replication factor is 3.

Lets say we have clients that in total set up 16 connections to each node.

Now all the clients write at the same time. Since the replication factor is
3, you could get up to 16*3=48  concurrent write request per node (which
needs to be handled by 8 threads)?

What is the result if this load continues?
Could you see that replication of data fails (at least initially) causing
all kinds of fun timeouts around in the system?

Same on the read side.
If all clients read at the same time with Consistency level QUORUM, you get
16*2 read requests in best case (and more in worst case)?

Could you see that one node answers, but another one times out due to lack
of read threads, causing read repair which again further degrades?

How does this queue up internally between thrift, gossip and the threads
doing the actual read and writes?

Regards,
Terje

secondary indexes on data imported by json2sstable

2011-03-14 Thread Terje Marthinussen

Hi,

Should it be expected that secondary indexes are automatically regenerated
when importing data using json2sstable?
Or is there some manual procedure that needs to be done to generate them?

Regards,
Terje

Re: 0.7.3 nodetool scrub exceptions

2011-03-08 Thread Terje Marthinussen

I had similar errors in late 0.7.3 releases related to testing I did for the
mails with subject "Argh: Data Corruption (LOST DATA) (0.7.0)".

I do not see these corruptions or the above error anymore with 0.7.3 release
as long as the dataset is created from scratch. The patch (2104) mentioned
in the "Argh" mail was already in the code I used though, so not entirely
sure what have fixed it, if it is fixed

We have done one change in our data at the same time though as we broke up a
very long row in smaller rows. This could be related as well.

Terje


On Wed, Mar 9, 2011 at 5:45 AM, Sylvain Lebresne wrote:

> Did you run scrub as soon as you updated to 0.7.3 ?
>
> And did you had problems/exceptions before running scrub ?
> If yes, did you had problems with only 0.7.3 or also with 0.7.2 ?
>
> If the problems started with running scrub, since it takes a snapshot
> before running, can you try restarting a test cluster with this snapshot
> and see if a simple compaction work for instance.
>
> --
> Sylvain
>
>
> On Tue, Mar 8, 2011 at 5:31 PM, Karl Hiramoto  wrote:
>
>> On 08/03/2011 17:09, Jonathan Ellis wrote:
>>
>>> No.
>>>
>>> What is the history of your cluster?
>>>
>> It started out as 0.7.0 - RC3 And I've upgraded 0.7.0, 0.7.1, 0.7.2,
>> 0.7.3  within a few days after each was released.
>>
>> I have 6 nodes about 10GB of data each RF=2.   Only one CF every
>> row/column has a TTL of 24 hours.
>> I do a staggered  repair/compact/cleanup across every node in a cronjob.
>>
>>
>> After upgrading to 0.7.3  I had a lot of nodes crashing due to OOM. I
>> reduced the key cache from the default 20 to 1000 and increased the heap
>> size from 8GB to 12GB and the OOM crashes went away.
>>
>>
>> Anyway to fix this without throwing away all the data?
>>
>> Since i only keep data 24 hours,  I could insert into two CF for the next
>> 24 hours than after only valid data in new CF remove the old CF.
>>
>>
>>
>>
>>  On Tue, Mar 8, 2011 at 5:34 AM, Karl Hiramoto  wrote:
>>>
 I have 1000's of these in the log  is this normal?

 java.io.IOError: java.io.EOFException: bloom filter claims to be longer
 than
 entire row size
at

 org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:117)
at

 org.apache.cassandra.db.CompactionManager.doScrub(CompactionManager.java:590)
at

 org.apache.cassandra.db.CompactionManager.access$600(CompactionManager.java:56)
at

 org.apache.cassandra.db.CompactionManager$3.call(CompactionManager.java:195)
at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.EOFException: bloom filter claims to be longer than
 entire row size
at

 org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:113)
at

 org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:87)
... 8 more
  WARN [CompactionExecutor:1] 2011-03-08 11:32:35,615
 CompactionManager.java
 (line 625) Row is unreadable; skipping to next
  WARN [CompactionExecutor:1] 2011-03-08 11:32:35,615
 CompactionManager.java
 (line 599) Non-fatal error reading row (stacktrace follows)
 java.io.IOError: java.io.EOFException: bloom filter claims to be longer
 than
 entire row size
at

 org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:117)
at

 org.apache.cassandra.db.CompactionManager.doScrub(CompactionManager.java:590)
at

 org.apache.cassandra.db.CompactionManager.access$600(CompactionManager.java:56)
at

 org.apache.cassandra.db.CompactionManager$3.call(CompactionManager.java:195)
at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.EOFException: bloom filter claims to be longer than
 entire row size
at

 org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:113)
at

 org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:87)
... 8 more
  WARN [CompactionExecutor:1] 2011

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

2011-03-05 Thread Terje Marthinussen

Hi,

Unfortunately, this patch is already included in the build I have.

Thanks for the suggestion though!
Terje

On Sat, Mar 5, 2011 at 7:47 PM, Sylvain Lebresne wrote:

> Also, if you can, please be sure to try the new 0.7.3 release. We had a bug
> with the compaction of superColumns for instance that is fixed there (
> https://issues.apache.org/jira/browse/CASSANDRA-2104). It also ships with
> a new scrub command that tries to find if your sstables are corrupted and
> repair them in that event. I can be worth trying it too.
>
> --
> Sylvain
>
>
> On Fri, Mar 4, 2011 at 7:04 PM, Benjamin Coverston <
> ben.covers...@datastax.com> wrote:
>
>>  Hi Terje,
>>
>> Can you attach the portion of your logs that shows the exceptions
>> indicating corruption? Which version are you on right now?
>>
>> Ben
>>
>>
>> On 3/4/11 10:42 AM, Terje Marthinussen wrote:
>>
>> We are seeing various other messages as well related to deserialization,
>> so this seems to be some random corruption somewhere, but so far it may seem
>> to be limited to supercolumns.
>>
>>  Terje
>>
>>  On Sat, Mar 5, 2011 at 2:26 AM, Terje Marthinussen <
>> tmarthinus...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>  Did you get anywhere on this problem?
>>>
>>>  I am seeing similar errors unfortunately :(
>>>
>>>  I tried to add some quick error checking to the serialization, and it
>>> seems like the data is ok there.
>>>
>>>  Some indication that this occurs in compaction and maybe in hinted
>>> handoff, but no indication that it occurs for sstables with newly written
>>> data.
>>>
>>>  Terje
>>>
>>
>>
>> --
>> Ben Coverston
>> DataStax -- The Apache Cassandra Companyhttp://www.datastax.com/
>>
>>
>

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

2011-03-04 Thread Terje Marthinussen

We are seeing various other messages as well related to deserialization, so
this seems to be some random corruption somewhere, but so far it may seem to
be limited to supercolumns.

Terje

On Sat, Mar 5, 2011 at 2:26 AM, Terje Marthinussen
wrote:

> Hi,
>
> Did you get anywhere on this problem?
>
> I am seeing similar errors unfortunately :(
>
> I tried to add some quick error checking to the serialization, and it seems
> like the data is ok there.
>
> Some indication that this occurs in compaction and maybe in hinted handoff,
> but no indication that it occurs for sstables with newly written data.
>
> Terje
>

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

2011-03-04 Thread Terje Marthinussen

Hi,

Did you get anywhere on this problem?

I am seeing similar errors unfortunately :(

I tried to add some quick error checking to the serialization, and it seems
like the data is ok there.

Some indication that this occurs in compaction and maybe in hinted handoff,
but no indication that it occurs for sstables with newly written data.

Terje

Re: 2x storage

2011-02-25 Thread Terje Marthinussen

Cassandra never compacts more than one column family at the time?

Regards,
Terje

On 26 Feb 2011, at 02:40, Robert Coli  wrote:

> On Fri, Feb 25, 2011 at 9:22 AM, A J  wrote:
>> I read in some cassandra notes that each node should be allocated
>> twice the storage capacity you wish it to contain. I think the reason
>> was during compaction another copy of SSTables have to be made before
>> the original ones are discarded.
> 
> This rule of thumb only exactly applies when you have a single CF. It
> is better stated as "your node needs to have enough room to
> successfully compact your largest columnfamily."
> 
> =Rob

Re: Fill disks more than 50%

2011-02-25 Thread Terje Marthinussen

>
>
> @Thibaut Britz
> Caveat:Using simple strategy.
> This works because cassandra scans data at startup and then serves
> what it finds. For a join for example you can rsync all the data from
> the node below/to the right of where the new node is joining. Then
> join without bootstrap then cleanup both nodes. (also you have to
> shutdown the first node so you do not have a lost write scenario in
> the time between rsync and new node startup)
>
>
rsync all data from node to left/right..
Wouldn't that mean that you need 2x the data to recover...?

Terje

Re: Fill disks more than 50%

2011-02-25 Thread Terje Marthinussen

 I am suggesting that your probably want to rethink your scheme design

> since partitioning by year is going to be bad performance since the
> old servers are going to be nothing more then expensive tape drives.
>

You fail to see the obvious

It is just the fact that most of the data is stale that makes the question
interesting in the first place, and I would obviously not have asked if
there would be an I/O throughput problem in doing this.

Now, when that is said, we tested a repair on a set of nodes that was 70-80%
full and no luck. Ran out of disk :(

Terje

Fill disks more than 50%

2011-02-23 Thread Terje Marthinussen

Hi,

Given that you have have always increasing key values (timestamps) and never
delete and hardly ever overwrite data.

If you want to minimize work on rebalancing and statically assign (new)
token ranges to new nodes as you add them so they always get the latest
data
Lets say you add a new node each year to handle next years data.

In a scenario like this, could you with 0.7 be able to safely fill disks
significantly more than 50% and still manage things like repair/recovery of
faulty nodes?


Regards,
Terje

Re: Compression in Cassandra

2011-01-20 Thread Terje Marthinussen

Perfectly normal with 3-7x increase in data size depending on you data schema.

Regards,
Terje

On 20 Jan 2011, at 23:17, "akshatbakli...@gmail.com"  
wrote:

> I just did a du -h DataDump which showed 40G
> and du -h CassandraDataDump which showed 170G
> 
> am i doing something wrong.
> have you observed some compression in it.
> 
> On Thu, Jan 20, 2011 at 6:57 PM, Javier Canillas  
> wrote:
> How do you calculate your 40g data? When you insert it into Cassandra, you 
> need to convert the data into a Byte[], maybe your problem is there.
> 
> 
> On Thu, Jan 20, 2011 at 10:02 AM, akshatbakli...@gmail.com 
>  wrote:
> Hi all,
> 
> I am experiencing a unique situation. I loaded some data onto Cassandra.
> my data was about 40 GB but when loaded to Cassandra the data directory size 
> is almost 170GB.
> 
> This means the **data got inflated**.
> 
> Is it the case just with me or some else is also facing the inflation or its 
> the general behavior of Cassandra.
> 
> I am using Cassandra 0.6.8. on Ubuntu 10.10
> 
> -- 
> Akshat Bakliwal
> Search Information and Extraction Lab
> IIIT-Hyderabad 
> 09963885762
> WebPage
> 
> 
> 
> 
> 
> -- 
> Akshat Bakliwal
> Search Information and Extraction Lab
> IIIT-Hyderabad 
> 09963885762
> WebPage
>

Re: Cassandra memtable and GC

2010-11-22 Thread Terje Marthinussen

Look at the graph again. Especially from the first posting.
The records/second read (by the client) goes down as disk reads goes down.

Looks like something is fishy with the memtables.

Terje

On Tue, Nov 23, 2010 at 1:54 AM, Edward Capriolo wrote:

> On Mon, Nov 22, 2010 at 8:28 AM, Shotaro Kamio 
> wrote:
> > Hi Peter,
> >
> > I've tested again with recording LiveSSTableCount and MemtableDataSize
> > via jmx. I guess this result supports my suspect on memtable
> > performance because I cannot find Full GC this time.
> > This is a result in smaller data size (160million records on
> > cassandra) on different disk configuration from my previous post. But
> > the general picture doesn't change.
> >
> > The attached files:
> > - graph-read-throughput-diskT.png:  read throughput on my client program.
> > - graph-diskT-stat-with-jmx.png: graph of cpu load, LiveSSTableCount
> > and logarithm of MemtableDataSize.
> > - log-gc.20101122-12:41.160M.log.gz: GC log with -XX:+PrintGC
> > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> >
> > As you can see from the second graph, logarithm of MemtableDataSize
> > and cpu load has a clear correlation. When a memtable is flushed and a
> > new SSTable is created (LiveSSTableCount is incremented), read
> > performance will be recovered. But it degrades soon.
> > I couldn't find Full GC in GC log in this test. So, I guess that this
> > performance is not a result of GC activity.
> >
> >
> > Regards,
> > Shotaro
> >
> >
> > On Sat, Nov 20, 2010 at 6:37 PM, Peter Schuller
> >  wrote:
> >>> After a memtable flush, you see minimum cpu and maximum read
> >>> throughput both in term of disk and cassandra records read.
> >>> As memtable increase in size, cpu goes up and read drops.
> >>> If this is because of memtable or GC performance issue, this is the
> >>> big question.
> >>>
> >>> As each memtable is just 128MB when flushed, I don't really expect GC
> >>> problem or caching issues.
> >>
> >> A memtable is basically just a ConcurrentSkipListMap. Unless you are
> >> somehow triggering some kind of degenerate casein the CSLM itself,
> >> which seems unlikely, the only common circumstance where filling the
> >> memtable should be resulting in a very significant performance drop
> >> should be if you're running really close to heap size and causing
> >> additional GC asymptotally as you're growing the memtable.
> >>
> >> But that doesn't seem to be the case. I don't know, maybe I missed
> >> something in your original post, but I'm not sure what to suggest that
> >> I haven't already without further information/hands-on
> >> experimentation/observation.
> >>
> >> But running with verbose GC as I mentioned should at least be a good
> >> start (-Xloggc:path/to/gclog
> >> -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps).
> >>
> >> --
> >> / Peter Schuller
> >>
> >
> >
> >
> > --
> > Shotaro Kamio
> >
>
> "As you can see from the second graph, logarithm of MemtableDataSize
> and cpu load has a clear correlation."
>
> This makes sense.
>
> "You'll see the disk read throughput is periodically going down and up.
> At 17:45:00, it shows zero disk read/sec. " --> This must mean that
> your load is being completely served from cache. If you have a very
> high cache hit rate CPU/Memory are the ONLY factor. If CPU and
> memtables are the only factor then larger memtables will start to
> perform slower then smaller memtables.
>
> Possibly with SSD the conventional thinking on Larger SSTables does
> not apply (at least for your active set)
>

Re: SSD vs. HDD

2010-11-03 Thread Terje Marthinussen

How high is high  and how much data do you have (Cassandra disk usage).

Regards,
Terje

On 4 Nov 2010, at 04:32, Alaa Zubaidi  wrote:

> Hi,
> we have a continuous high throughput writes, read and delete, and we are 
> trying to find the best hardware.
> Is using SSD for Cassandra improves performance? Did any one compare SSD vs. 
> HDD? and any recommendations on SSDs?
> 
> Thanks,
> Alaa
>

network configurations in medium to large size installations

2010-10-27 Thread Terje Marthinussen

Hi,

Just curious if anyone has any best practices/experiences/thoughts to share
on network configurations for cassandra setups with tens to hundreds of
nodes and high traffic (thousands of requests/sec)?

For instance:
- Do you just "hook it all together"?
- If you have 2 interfaces, do you prefer separating thrift and gossip or
bond the interfaces to get more redundancy (but less bandwidth and traffic
isolation)?
- Issues with network bandwidth. For instance if you have major events
occurring such as adding a new replica or if parts if a ring goes offline
and you need to resync when it comes up again? Especially considering if you
have the nodes on several switches, the trunks between these switches could
get quite busy?
- As the ring increases in size, you also need more and more switches (lets
try to avoid big expensive network stuff...) and potential bottlenecks on
the trunks. Any thoughts on scaling this? (tree like structures of switches,
big single switch with loads of ports, more flat network layout)



Terje

Re: about insert benchmark

2010-09-02 Thread Terje Marthinussen

1000 and 1 records take too short time to really benchmark anything. You
will use 2 seconds just for stuff like tcp_windows sizes to adjust to the
level were you get throughput.

The difference between 100k and 500k is less than 10%. Could be anything.

Filesystem caches, sizes of memtables (default memcache settings flushes a
memtable when it reaches 300k entries)... difficult to say.

You should benchmark something larger than that. Need to at least to trigger
some SSTable compactions and proper Java GC work if you really want to know
what your performance is.

Terje

On Thu, Sep 2, 2010 at 4:08 PM, Thorvaldsson Justus <
justus.thorvalds...@svenskaspel.se> wrote:

>  Batchmutate insert? Can be package size that differ if not nr threads
> sending data to Cassandra nodes.
>
>
>
> *Från:* ChingShen [mailto:chingshenc...@gmail.com]
> *Skickat:* den 2 september 2010 08:59
> *Till:* user@cassandra.apache.org
> *Ämne:* Re: about insert benchmark
>
>
>
> Hi Daniel,
>
>I have 4 nodes in my cluster, and run a benchmark on node A in Java.
>   P.S. Replication = 3
>
> Shen
>
> On Thu, Sep 2, 2010 at 2:49 PM, vineet daniel 
> wrote:
>
> Hi Ching
>
> You are inserting using php,perl,python,java or ? and is cassandra
> installed locally or on a network system and is it a single system or you
> have a cluster of nodes. I know I've asked you many questions but the
> answers will help immensely to assess the results.
>
> Anyways congrats on getting better results :-) .
>
>
> ___
> Regards
> Vineet Daniel
> +918106217121
> ___
>
> Let your email find you
>
>On Thu, Sep 2, 2010 at 11:39 AM, ChingShen 
> wrote:
>
> Hi all,
>
>   I run a benchmark with my own code and found that the 10 inserts
> performance is better than others, Why?
>  Can anyone explain it?
>
> Thanks.
>
> Partitioner = OPP
> CL = ONE
> ==
> 1000 records
> insert one:201 ms
> insert per:0.201 ms
> insert thput:4975.1245 ops/sec
> ==
> 1 records
> insert one:1950 ms
> insert per:0.195 ms
> insert thput:5128.205 ops/sec
> ==
> 10 records
> insert one:15576 ms
> insert per:0.15576 ms
> insert thput:6420.134 ops/sec
> ==
> 50 records
> insert one:82177 ms
> insert per:0.164354 ms
> insert thput:6084.4272 ops/sec
>
> Shen
>
>
>
>
>

order of mutations in batch_mutate

2010-09-01 Thread Terje Marthinussen

Hi,

Just a curiosity. I should probably read some code and write a test to make
sure, but not important enough right now for that :)

   -


   void batch_mutate(string keyspace,
map>> mutation_map, ConsistencyLevel
consistency_level)

Will performance of a batch_mutate be affected by the order of mutations in
the list?

Terje

Re: column family names

2010-08-31 Thread Terje Marthinussen

Sure, but as I am likely to have multiple clients (which I may not control) 
accessing a single store, I would prefer to keep such custom mappings out of 
the client for consistency reasons (much bigger problem than any of the 
operational issues highlighted so far).

Terje  

On 31 Aug 2010, at 23:03, David Boxenhorn  wrote:

> It's not so hard to implement your mapping suggestion in your application, 
> rather than in Cassandra, if you really want it. 
> 
> On Tue, Aug 31, 2010 at 1:05 PM, Terje Marthinussen  
> wrote:
> No benefit?
> Making it easier to use column families as part of your data model is a 
> fairly good benefit, at least given the somewhat special data model cassandra 
> offers. Much more of a benefit than the disadvantages I can imagine.
> 
> fileprefix=`sometool -fileprefix tablename`
> is something I would say is a lot more unixy than windows like.
> 
> Sorry, I don't share your concern for large scale operations here, but sure, 
> '_' does the trick for me now so thanks to Aaron for reminding me about that.
> 
> Some day I am sure there will be realized that unicode strings/byte arrays 
> are useful here like most other places in Cassandra (\w is a bit limited for 
> some of us living in the non-ascii part of the world...), but "what is the 
> XXX way" are not the type of topics I find interesting, so another time.
> 
> Terje
> 
> 
> 
> On Tue, Aug 31, 2010 at 5:30 PM, Benjamin Black  wrote:
> This is not the Unix way for good reason: it creates all manner of
> operational challenges for no benefit.  This is how Windows does
> everything and automation and operations for large-scale online
> services is _hellish_ because of it.  This horse is sufficiently
> beaten, though.
> 
> 
> b
> 
> On Mon, Aug 30, 2010 at 11:55 PM, Terje Marthinussen
>  wrote:
> > Another option would of course be to store a mapping between dir/filenames
> > and Keyspace/columns familes together with other info related to keyspaces
> > and column families. Just add API/command line tools to look up the
> > filenames and maybe store the values in the files as well for recovery
> > purposes.
> >
> > Terje
> >
> > On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen 
> > wrote:
> >>
> >> I've been doing it for years with no technical problems. However, using
> >> "%" as the escape char tends to, in some cases, confuse a certain operating
> >> system whose name may or may not begin with "W", so using something else
> >> makes sense.
> >> However, it does require an extra cognitive step for the maintainer, since
> >> the mapping between filenames and logical names is no longer immediately
> >> obvious. Especially with multiple files this can be a pain (e.g. Chinese
> >> logical names which map to pretty incomprehensible sequences that are
> >> laborious to look up).
> >> So my experience suggests to avoid it for ops reasons, and just go with
> >> simplicity.
> >> /Janne
> >> On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote:
> >>
> >> Beyond aesthetics, specific reasons?
> >>
> >> Terje
> >>
> >> On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black  wrote:
> >>>
> >>> URL encoding.
> >>>
> >>
> >
> >
> 
>

Re: column family names

2010-08-31 Thread Terje Marthinussen

No benefit?
Making it easier to use column families as part of your data model is a
fairly good benefit, at least given the somewhat special data model
cassandra offers. Much more of a benefit than the disadvantages I can
imagine.

fileprefix=`sometool -fileprefix tablename`
is something I would say is a lot more unixy than windows like.

Sorry, I don't share your concern for large scale operations here, but sure,
'_' does the trick for me now so thanks to Aaron for reminding me about
that.

Some day I am sure there will be realized that unicode strings/byte arrays
are useful here like most other places in Cassandra (\w is a bit limited for
some of us living in the non-ascii part of the world...), but "what is the
XXX way" are not the type of topics I find interesting, so another time.

Terje

On Tue, Aug 31, 2010 at 5:30 PM, Benjamin Black  wrote:

> This is not the Unix way for good reason: it creates all manner of
> operational challenges for no benefit.  This is how Windows does
> everything and automation and operations for large-scale online
> services is _hellish_ because of it.  This horse is sufficiently
> beaten, though.
>
>
> b
>
> On Mon, Aug 30, 2010 at 11:55 PM, Terje Marthinussen
>  wrote:
> > Another option would of course be to store a mapping between
> dir/filenames
> > and Keyspace/columns familes together with other info related to
> keyspaces
> > and column families. Just add API/command line tools to look up the
> > filenames and maybe store the values in the files as well for recovery
> > purposes.
> >
> > Terje
> >
> > On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen <
> janne.jalka...@ecyrd.com>
> > wrote:
> >>
> >> I've been doing it for years with no technical problems. However, using
> >> "%" as the escape char tends to, in some cases, confuse a certain
> operating
> >> system whose name may or may not begin with "W", so using something else
> >> makes sense.
> >> However, it does require an extra cognitive step for the maintainer,
> since
> >> the mapping between filenames and logical names is no longer immediately
> >> obvious. Especially with multiple files this can be a pain (e.g. Chinese
> >> logical names which map to pretty incomprehensible sequences that are
> >> laborious to look up).
> >> So my experience suggests to avoid it for ops reasons, and just go with
> >> simplicity.
> >> /Janne
> >> On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote:
> >>
> >> Beyond aesthetics, specific reasons?
> >>
> >> Terje
> >>
> >> On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black  wrote:
> >>>
> >>> URL encoding.
> >>>
> >>
> >
> >
>

Re: column family names

2010-08-30 Thread Terje Marthinussen

Another option would of course be to store a mapping between dir/filenames
and Keyspace/columns familes together with other info related to keyspaces
and column families. Just add API/command line tools to look up the
filenames and maybe store the values in the files as well for recovery
purposes.

Terje

On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen wrote:

>
> I've been doing it for years with no technical problems. However, using "%"
> as the escape char tends to, in some cases, confuse a certain operating
> system whose name may or may not begin with "W", so using something else
> makes sense.
>
> However, it does require an extra cognitive step for the maintainer, since
> the mapping between filenames and logical names is no longer immediately
> obvious. Especially with multiple files this can be a pain (e.g. Chinese
> logical names which map to pretty incomprehensible sequences that are
> laborious to look up).
>
> So my experience suggests to avoid it for ops reasons, and just go with
> simplicity.
>
> /Janne
>
> On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote:
>
> Beyond aesthetics, specific reasons?
>
> Terje
>
> On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black  wrote:
>
>> URL encoding.
>>
>>
>

Re: column family names

2010-08-30 Thread Terje Marthinussen

Beyond aesthetics, specific reasons?

Terje

On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black  wrote:

> URL encoding.
>
>

Re: cassandra disk usage

2010-08-30 Thread Terje Marthinussen

On Mon, Aug 30, 2010 at 10:10 PM, Jonathan Ellis  wrote:

> column names are stored per cell
>
> (moving to user@)
>

I think that is already accommodated for in my numbers?

What i listed was measured from the actual SSTable file (using the output
from "strings ), so multiples of the supercolumn and columns
names is already part of the strings output.

Typically, you get something like this as output from strings:
20100629
20100629
20100629

java.util.BitSetn
bitst
[Jxpur
[Jx

repeating.

I am not entirely sure why I get those repeating supercolumn names there
(there are more supercolumn names in this file than column names, which is
not logical, it should be the other way around!), but I will have a closer
look at that one.

These strings makes up about 1/2 of the total data. The remainder being
binary and tons of null bytes.

The strings command (which will of course give me some binary noise) returns
14.943.928 bytes (or rather characters) of data
If we ignore the binary noise for a second and also count the number of null
bytes in this file, we get:

Text: 14,943,928 bytes (as mentioned in my previous posting, 9.4MB of this
is column headers)
Null Bytes: 14,634,412 bytes
Other (binary): 8,580,188 bytes
Total size: 38,158,528

Yes yes yes, doing this is ugly and lots of null bytes would occur for many
reasons (no reason to tell me that), but chew on that number for a second
and take a look at an SSTable near you, there is a heck of a lot of nothing
there.

Should be noted that this is 0.7 beta 1.

I realize that this code will change dramatically by 0.8 so this is probably
not too interesting to spend too much time on,  but the expansion of data is
pretty excessive in many scenarios, so I just looked briefely at an actual
file trying to understand it a bit better.

Terje

1 2 >

1 - 100 of 102 matches

Mail list logo