date:20110208

unsubscribe

2011-02-08 Thread Lubos Pusty

unsubscribe

Re: Best Approaches for Developer Integration

2011-02-08 Thread Stephen Connolly

On 8 February 2011 06:40, Paul Brown  wrote:
>
> On Feb 7, 2011, at 10:28 PM, Paul Querna wrote:
>> So, I guess this is coming down to:
>>  1) Has anyone built any easy to install packages of Cassandra?
>
> I didn't find it necessary.  I implemented a simple embedding wrapper for 
> Cassandra so that it could be started as part of a web application lifecycle 
> (Spring-managed).  Developers just start up a personal copy of the service as 
> part of "mvn -Pembedded-cassandra jetty:run" being none the wiser about the 
> Cassandra underneath unless they care to be.
>

Mojo's Cassandra Maven Plugin makes this even easier

mvn cassandra:start jetty:run

If you don't want to modify your pom.xml to switch either jetty or
cassandra off of port 8080 then you'll end up with

mvn cassandra:start jetty:run -Dcassandra.jmxPort=_

Note: the plugin has not been released yet... 48hr left on the release vote

>>  2) Can anyone explain their experience with getting non-Cassandra
>> developers up and running in a development environment? What worked?
>> What was hard?
>
> I've had technically savvy non-developer resources perfectly happy to work 
> with the system via Ruby, PHP, or even the Cassandra CLI.  "Just do mvn 
> -Pembedded-cassandra jetty:run" was too much in that case, but "Here are some 
> useful libraries and here's where the prod/staging clusters are" was fine.
>
> -- Paul B

Re: Do supercolumns have a purpose?

2011-02-08 Thread David Boxenhorn

Shaun, I agree with you, but marking them as deprecated is not good enough
for me. I can't easily stop using supercolumns. I need an upgrade path.

On Tue, Feb 8, 2011 at 3:53 AM, Shaun Cutts  wrote:

>
> I'm a newbie here, but, with apologies for my presumptuousness, I think you
> should deprecate SuperColumns. They are already distracting you, and as the
> years go by the cost of supporting them as you add more and more
> functionality is only likely to get worse. It would be better to concentrate
> on making the "core" column families better (and I'm sure we can all think
> of lots of things we'd like).
>
> Just dropping SuperColumns would be bad for your reputation -- and for
> users like David who are currently using them. But if you mark them clearly
> as deprecated and explain why and what to do instead (perhaps putting a bit
> of effort into migration tools... or even a "virtual" layer supporting
> arbitrary hierarchical data), then you can drop them in a few years (when
> you get to 1.0, say), without people feeling betrayed.
>
> -- Shaun
>
> On Feb 6, 2011, at 3:48 AM, David Boxenhorn wrote:
>
> "My main point was to say that it's think it is better to create tickets
> for what you want, rather than for something else completely different that
> would, as a by-product, give you what you want."
>
> Then let me say what I want: I want supercolumn families to have any
> feature that regular column families have.
>
> My data model is full of supercolumns. I used them, even though I knew it
> didn't *have to*, "because they were there", which implied to me that I was
> supposed to use them for some good reason. Now I suspect that they will
> gradually become less and less functional, as features are added to regular
> column families and not supported for supercolumn families.
>
>
> On Fri, Feb 4, 2011 at 10:58 AM, Sylvain Lebresne wrote:
>
>> On Fri, Feb 4, 2011 at 12:35 AM, Mike Malone  wrote:
>>
>>> On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne 
>>> wrote:
>>>
 On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn wrote:

> The advantage would be to enable secondary indexes on supercolumn
> families.
>

 Then I suggest opening a ticket for adding secondary indexes to
 supercolumn families and voting on it. This will be 1 or 2 order of
 magnitude less work than getting rid of super column internally, and
 probably a much better solution anyway.

>>>
>>> I realize that this is largely subjective, and on such matters code
>>> speaks louder than words, but I don't think I agree with you on the issue of
>>> which alternative is less work, or even which is a better solution.
>>>
>>
>> You are right, I put probably too much emphase in that sentence. My main
>> point was to say that it's think it is better to create tickets for what you
>> want, rather than for something else completely different that would, as a
>> by-product, give you what you want.
>> Then I suspect that *if* the only goal is to get secondary indexes on
>> super columns, then there is a good chance this would be less work than
>> getting rid of super columns. But to be fair, secondary indexes on super
>> columns may not make too much sense without #598, which itself would require
>> quite some work, so clearly I spoke a bit quickly.
>>
>>
>>> If the goal is to have a hierarchical model, limiting the depth to two
>>> seems arbitrary. Why not go all the way and allow an arbitrarily deep
>>> hierarchy?
>>>
>>> If a more sophisticated hierarchical model is deemed unnecessary, or
>>> impractical, allowing a depth of two seems inconsistent and
>>> unnecessary. It's pretty trivial to overlay a hierarchical model on top of
>>> the map-of-sorted-maps model that Cassandra implements. Ed Anuff has
>>> implemented a custom comparator that does the job [1]. Google's Megastore
>>> has a similar architecture and goes even further [2].
>>>
>>> It seems to me that super columns are a historical artifact from
>>> Cassandra's early life as Facebook's inbox storage system. They needed
>>> posting lists of messages, sharded by user. So that's what they built. In my
>>> dealings with the Cassandra code, super columns end up making a mess all
>>> over the place when algorithms need to be special cased and branch based on
>>> the column/supercolumn distinction.
>>>
>>> I won't even mention what it does to the thrift interface.
>>>
>>
>> Actually, I agree with you, more than you know. If I were to start coding
>> Cassandra now, I wouldn't include super columns (and I would probably not go
>> for a depth unlimited hierarchical model either). But it's there and I'm not
>> sure getting rid of them fully (meaning, including in thrift) is an option
>> (it would be a big compatibility breakage). And (even though I certainly
>> though about this more than once :)) I'm slightly less enthusiastic about
>> keeping them in thrift but encoding them in regular column family
>> internally: it would still be a lot of work but we would

cassandra-cli (output) broken for super columns

2011-02-08 Thread Timo Nentwig

This is not what it's supposed to be like, is it?

[default@foo] get foo[page-field];  
  
=> (super_column=20110208,
 (column=82f4c650-2d53-11e0-a08b-58b035f3f60d, value=msg1, 
timestamp=1297159430471000)
 (column=82f4c650-2d53-11e0-a08b-58b035f3f60e, value=msg2, 
timestamp=1297159437423000)
 (column=82f4c650-2d53-11e0-a08b-58b035f3f60f, value=msg3, 
timestamp=1297159439855000))
Returned 1 results.

[default@foo] get foo[page-field][20110208];
  
, value=msg1, timestamp=1297159430471000)
=> (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
=> (column=???P-S???X?5??, value=msg3, timestamp=1297159439855000)
Returned 3 results.

[default@foo] get 
foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60d];
, value=msg1, timestamp=1297159430471000)

[default@foo] get 
foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60e];
=> (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)


- name: foo
  column_type: Super
  compare_with: AsciiType
  compare_subcolumns_with: TimeUUIDType
  default_validation_class: AsciiType

Re: OOM during batch_mutate

2011-02-08 Thread Patrik Modesto

On Tue, Feb 8, 2011 at 00:05, Jonathan Ellis  wrote:
> Sounds like the keyspace was created on the 32GB machine, so it
> guessed memtable sizes that are too large when run on the 16GB one.
> Use "update column family" from the cli to cut the throughput and
> operations thresholds in half, or to 1/4 to be cautious.

That was exactly the problem, thanks a lot Jonathan!

Patrik

Can serialized objects in columns serve as ersatz superCFs?

2011-02-08 Thread buddhasystem


Seeing that discussion here about indexes not supported in superCFs, and less
than clear future of superCFs altogether, I was thinking about getting a
modicum of same functionality with serialized objects inside columns. This
way the column key becomes sort of analog of supercolumn key, and I handle
the dictionaries I receive in the client.

Does this sound OK?

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Can-serialized-objects-in-columns-serve-as-ersatz-superCFs-tp6003775p6003775.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: cassandra-cli (output) broken for super columns

2011-02-08 Thread Stephen Connolly

On 8 February 2011 10:38, Timo Nentwig  wrote:
> This is not what it's supposed to be like, is it?
>
> [default@foo] get foo[page-field];
> => (super_column=20110208,
>     (column=82f4c650-2d53-11e0-a08b-58b035f3f60d, value=msg1, 
> timestamp=1297159430471000)
>     (column=82f4c650-2d53-11e0-a08b-58b035f3f60e, value=msg2, 
> timestamp=1297159437423000)
>     (column=82f4c650-2d53-11e0-a08b-58b035f3f60f, value=msg3, 
> timestamp=1297159439855000))
> Returned 1 results.
>
> [default@foo] get foo[page-field][20110208];
> , value=msg1, timestamp=1297159430471000)
> => (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
> => (column=???P-S???X?5??, value=msg3, timestamp=1297159439855000)
> Returned 3 results.
>
> [default@foo] get 
> foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60d];
> , value=msg1, timestamp=1297159430471000)
>
> [default@foo] get 
> foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60e];
> => (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
>
>
>        - name: foo
>          column_type: Super
>          compare_with: AsciiType
>          compare_subcolumns_with: TimeUUIDType
>          default_validation_class: AsciiType

Is it the ?'s that you are complaining about or is it something else?

If it is the ?'s have you got a mismatch between the character
encoding in your shell and UTF-8?

Re: cassandra-cli (output) broken for super columns

2011-02-08 Thread Timo Nentwig


On Feb 8, 2011, at 13:41, Stephen Connolly wrote:

> On 8 February 2011 10:38, Timo Nentwig  wrote:
>> This is not what it's supposed to be like, is it?

Looks alright:

>> [default@foo] get foo[page-field];
>> => (super_column=20110208,
>> (column=82f4c650-2d53-11e0-a08b-58b035f3f60d, value=msg1, 
>> timestamp=1297159430471000)
>> (column=82f4c650-2d53-11e0-a08b-58b035f3f60e, value=msg2, 
>> timestamp=1297159437423000)
>> (column=82f4c650-2d53-11e0-a08b-58b035f3f60f, value=msg3, 
>> timestamp=1297159439855000))
>> Returned 1 results.

Missing first half of column 1 is  and UUID is not printed correctly anymore:

>> [default@foo] get foo[page-field][20110208];
>> , value=msg1, timestamp=1297159430471000)
>> => (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
>> => (column=???P-S???X?5??, value=msg3, timestamp=1297159439855000)
>> Returned 3 results.

Still prints only half of the column:

>> [default@foo] get 
>> foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60d];
>> , value=msg1, timestamp=1297159430471000)

Applies only to first column?!

>> [default@foo] get 
>> foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60e];
>> => (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
>> 
>> 
>>- name: foo
>>  column_type: Super
>>  compare_with: AsciiType
>>  compare_subcolumns_with: TimeUUIDType
>>  default_validation_class: AsciiType
> 
> Is it the ?'s that you are complaining about or is it something else?
> 
> If it is the ?'s have you got a mismatch between the character
> encoding in your shell and UTF-8?

Nope. See above :) Esp. that the first column isn't printed completely.

Re: Best way to detect/fix bitrot today?

2011-02-08 Thread Anand Somani

I should have clarified we have 3 copies, so in that case as long as 2 match
we should be ok?

Even if there were checksumming at the SStable level, I assume it has to
check and report these errors on compaction (or node repair)?

I have seen some JIRA open on these issues ( 47 and 1717), but if I need
something today, a read repair ( or a node repair) is the only viable
option?



On Mon, Feb 7, 2011 at 12:09 PM, Peter Schuller  wrote:

> > Our application space is such that there is data that might not be read
> for
> > a long time. The data is mostly immutable. How should I approach
> > detecting/solving the bitrot problem? One approach is read data and let
> read
> > repair do the detection, but given the size of data, that does not look
> very
> > efficient.
>
> Note that read-repair is not really intended to repair arbitrary
> corruptions. Unless I'm mistaken, arbitrary corruption, unless it
> triggers a serialization failure that causes row skipping, it's a
> toss-up which version of the data is retained (or both, if the
> corruption is in the key). Given the same key and column timestamp,
> the tie breaker is the volumn value. So depending on whether
> corruption results in a "lesser" or "greater" value, you might get the
> corrupt or non-corrupt data.
>
> > Has anybody solved/workaround this or has any other suggestions to detect
> > and fix bitrot?
>
> My feel/tentative opinion is that the clean fix is for Cassandra to
> support strong checksumming at the sstable level.
>
> Deploying on e.g. ZFS would help a lot with this, but that's a problem
> for deployment on Linux (which is the recommended platform for
> Cassandra).
>
> --
> / Peter Schuller
>

Re: Best way to detect/fix bitrot today?

2011-02-08 Thread Shaun Cutts

One thing that we're doing for (guaranteed) immutable data is to use MD5 
signatures as keys... this will also prevent duplication, and it will allow 
detection (if not correction) of bitrot at the app level easy.

On Feb 8, 2011, at 9:23 AM, Anand Somani wrote:

> I should have clarified we have 3 copies, so in that case as long as 2 match 
> we should be ok? 
> 
> Even if there were checksumming at the SStable level, I assume it has to 
> check and report these errors on compaction (or node repair)? 
> 
> I have seen some JIRA open on these issues ( 47 and 1717), but if I need 
> something today, a read repair ( or a node repair) is the only viable option? 
> 
>  
> 
> On Mon, Feb 7, 2011 at 12:09 PM, Peter Schuller  
> wrote:
> > Our application space is such that there is data that might not be read for
> > a long time. The data is mostly immutable. How should I approach
> > detecting/solving the bitrot problem? One approach is read data and let read
> > repair do the detection, but given the size of data, that does not look very
> > efficient.
> 
> Note that read-repair is not really intended to repair arbitrary
> corruptions. Unless I'm mistaken, arbitrary corruption, unless it
> triggers a serialization failure that causes row skipping, it's a
> toss-up which version of the data is retained (or both, if the
> corruption is in the key). Given the same key and column timestamp,
> the tie breaker is the volumn value. So depending on whether
> corruption results in a "lesser" or "greater" value, you might get the
> corrupt or non-corrupt data.
> 
> > Has anybody solved/workaround this or has any other suggestions to detect
> > and fix bitrot?
> 
> My feel/tentative opinion is that the clean fix is for Cassandra to
> support strong checksumming at the sstable level.
> 
> Deploying on e.g. ZFS would help a lot with this, but that's a problem
> for deployment on Linux (which is the recommended platform for
> Cassandra).
> 
> --
> / Peter Schuller
>

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

So the empty row will be ultimately removed then? Is there a way to
for the GC to verify this?

Thanks,
-Kal

On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
> The expired columns were converted into tombstones, which will live for the
> GC timeout. The "empty" row will be cleaned up when those tombstones are
> removed.
> Returning the empty row is unfortunate... we'd love to find a more
> appropriate solution that might not involve endless scanning.
> See
> http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
> http://wiki.apache.org/cassandra/FAQ#range_ghosts
>
> On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>  wrote:
>>
>> I also tried forcing a major compaction on the column family using JMX
>> but the row remains.
>>
>> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>>  wrote:
>> > I tried that but I still see the row coming back on a list
>> >  in the CLI. My concern is that there will be a pointer
>> > to an empty row for all eternity.
>> >
>> > -Kal
>> >
>> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton 
>> > wrote:
>> >> Deleting all the columns in a row via TTL has the same affect as
>> >> deleting th
>> >> row, the data will physically by removed during compaction.
>> >>
>> >> Aaron
>> >>
>> >>
>> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs  wrote:
>> >>
>> >> I don't think this is supported (but I could be completely wrong).
>> >> However, I'd love to see this functionality as well.
>> >>
>> >> How would one go about requesting such a feature?
>> >>
>> >> Bill-
>> >>
>> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
>> >>  wrote:
>> >>> Hey,
>> >>>
>> >>> I have read about the new TTL columns in Cassandra 0.7. In my case I'd
>> >>> like to expire an entire row automatically after a certain amount of
>> >>> time. Is this possible as well?
>> >>>
>> >>> Thanks,
>> >>> -Kal
>> >>>
>> >>
>> >
>
>

Re: time to live rows

2011-02-08 Thread David Boxenhorn

I hope you don't consider this a hijack of the thread...

What I'd like to know is the following:

The GC removes TTL rows some time after they expire, at its convenience. But
will they stop being returned as soon as they expire? (This is the expected
behavior...)

On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg  wrote:

> So the empty row will be ultimately removed then? Is there a way to
> for the GC to verify this?
>
> Thanks,
> -Kal
>
> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
> > The expired columns were converted into tombstones, which will live for
> the
> > GC timeout. The "empty" row will be cleaned up when those tombstones are
> > removed.
> > Returning the empty row is unfortunate... we'd love to find a more
> > appropriate solution that might not involve endless scanning.
> > See
> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
> >
> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
> >  wrote:
> >>
> >> I also tried forcing a major compaction on the column family using JMX
> >> but the row remains.
> >>
> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
> >>  wrote:
> >> > I tried that but I still see the row coming back on a list
> >> >  in the CLI. My concern is that there will be a pointer
> >> > to an empty row for all eternity.
> >> >
> >> > -Kal
> >> >
> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton  >
> >> > wrote:
> >> >> Deleting all the columns in a row via TTL has the same affect as
> >> >> deleting th
> >> >> row, the data will physically by removed during compaction.
> >> >>
> >> >> Aaron
> >> >>
> >> >>
> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs 
> wrote:
> >> >>
> >> >> I don't think this is supported (but I could be completely wrong).
> >> >> However, I'd love to see this functionality as well.
> >> >>
> >> >> How would one go about requesting such a feature?
> >> >>
> >> >> Bill-
> >> >>
> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
> >> >>  wrote:
> >> >>> Hey,
> >> >>>
> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my case
> I'd
> >> >>> like to expire an entire row automatically after a certain amount of
> >> >>> time. Is this possible as well?
> >> >>>
> >> >>> Thanks,
> >> >>> -Kal
> >> >>>
> >> >>
> >> >
> >
> >
>

Subcolumn Indexing

2011-02-08 Thread Jeremy.Truelove

I had a question on a sentence about the data model and how things are stored 
and retrieved that I came across in the O'Reilly book in the Data Model chapter.

"Cassandra does not index subcolumns, so when you load a super column into 
memory, all of its columns are loaded as well."

Does this just mean the exhaustive list of the column names not all the values? 
So if I have a super column that has a map of keys that only contain two 
columns max each this shouldn't really be a performance concern correct? This 
becomes an issue when you have lots of subcolumns if I'm reading this 
correctly? I'm looking at using the super column as a good way to cluster data, 
say I was storing home addresses I might use the zipcode as the super column if 
I cared mostly about accessing data by logical area for instance. Thanks for 
the help.

jt

___

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the 
foregoing.  Barclays Capital is the investment banking division of Barclays 
Bank PLC, a company registered in England (number 1026167) with its registered 
office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be 
sent from other members of the Barclays Group.
___

Re: Cassandra memory consumption

2011-02-08 Thread Victor Kabdebon

It is really weird that I am the only one to have this issue.
I restarted Cassandra today and already the memory compution is over the
limit :

root  1739  4.0 24.5 664968 *494996* pts/4   SLl  15:51   0:12
/usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dstorage-config=bin/../conf -cp
bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
org.apache.cassandra.thrift.CassandraDaemon

It is really an annoying problem if we cannot really foresee memory
consumption.

Best regards,
Victor K

2011/2/8 Victor Kabdebon 

> Dear all,
>
> Sorry to come back again to this point but I am really worried about
> Cassandra memory consumption. I have a single machine that runs one
> Cassandra server. There is almost no data on it but I see a crazy memory
> consumption and it doesn't care at all about the instructions...
> Note that I am not using mmap, but "Standard", I use also JNA (inside lib
> folder), i am running on debian 5 64 bits, so a pretty normal configuration.
> I also use Cassandra 0.6.8.
>
>
> Here are the informations I gathered on Cassandra :
>
> 105  16765  0.1 34.1 1089424* 687476* ?  Sl   Feb02  14:58
> /usr/bin/java -ea* -Xms128M* *-Xmx256M* -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
> -Dcom.sun.management.jmxremote.port=8081
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
> org.apache.cassandra.thrift.CassandraDaemon
>
> result of nodetool info :
>
> 116024732779488843382476400091948985708
> *Load : 1,94 MB*
> Generation No: 1296673772
> Uptime (seconds) : 467550
> *Heap Memory (MB) : 120,26 / 253,94*
>
>
> I have about 21 column families, none of them have a lot of information (
> as you see I have 2 Mb of text which is really small). Even if I set Xmx at
> 256 there is 687M of memory used. Where does this memory come from ? Bad
> garbage collection ? Something that I ignore ?
> Thank you for your help I really need to get rid of that problem.
>
> Best regards,
> Victor Kabdebon
>

Re: time to live rows

2011-02-08 Thread Sylvain Lebresne

> So the empty row will be ultimately removed then? Is there a way to
> for the GC to verify this?
>

Set a GcGraceSecond very low and force a major compaction.


>
> Thanks,
> -Kal
>
> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
> > The expired columns were converted into tombstones, which will live for
> the
> > GC timeout. The "empty" row will be cleaned up when those tombstones are
> > removed.
> > Returning the empty row is unfortunate... we'd love to find a more
> > appropriate solution that might not involve endless scanning.
> > See
> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
> >
> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
> >  wrote:
> >>
> >> I also tried forcing a major compaction on the column family using JMX
> >> but the row remains.
> >>
> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
> >>  wrote:
> >> > I tried that but I still see the row coming back on a list
> >> >  in the CLI. My concern is that there will be a pointer
> >> > to an empty row for all eternity.
> >> >
> >> > -Kal
> >> >
> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton  >
> >> > wrote:
> >> >> Deleting all the columns in a row via TTL has the same affect as
> >> >> deleting th
> >> >> row, the data will physically by removed during compaction.
> >> >>
> >> >> Aaron
> >> >>
> >> >>
> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs 
> wrote:
> >> >>
> >> >> I don't think this is supported (but I could be completely wrong).
> >> >> However, I'd love to see this functionality as well.
> >> >>
> >> >> How would one go about requesting such a feature?
> >> >>
> >> >> Bill-
> >> >>
> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
> >> >>  wrote:
> >> >>> Hey,
> >> >>>
> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my case
> I'd
> >> >>> like to expire an entire row automatically after a certain amount of
> >> >>> time. Is this possible as well?
> >> >>>
> >> >>> Thanks,
> >> >>> -Kal
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: time to live rows

2011-02-08 Thread Sylvain Lebresne

>
> I hope you don't consider this a hijack of the thread...
>
> What I'd like to know is the following:
>
> The GC removes TTL rows some time after they expire, at its convenience.
> But will they stop being returned as soon as they expire? (This is the
> expected behavior...)
>

It is the individual column that have TTL. When a column expires, it becomes
a delete tombstone. Now, a row with tombstones (even only them) will show
during range request. But the explanation is here:
http://wiki.apache.org/cassandra/FAQ#range_ghosts


>
> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg <
> kallin.nagelb...@gmail.com> wrote:
>
>> So the empty row will be ultimately removed then? Is there a way to
>> for the GC to verify this?
>>
>> Thanks,
>> -Kal
>>
>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
>> > The expired columns were converted into tombstones, which will live for
>> the
>> > GC timeout. The "empty" row will be cleaned up when those tombstones are
>> > removed.
>> > Returning the empty row is unfortunate... we'd love to find a more
>> > appropriate solution that might not involve endless scanning.
>> > See
>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>> >
>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>> >  wrote:
>> >>
>> >> I also tried forcing a major compaction on the column family using JMX
>> >> but the row remains.
>> >>
>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>> >>  wrote:
>> >> > I tried that but I still see the row coming back on a list
>> >> >  in the CLI. My concern is that there will be a pointer
>> >> > to an empty row for all eternity.
>> >> >
>> >> > -Kal
>> >> >
>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton <
>> aa...@thelastpickle.com>
>> >> > wrote:
>> >> >> Deleting all the columns in a row via TTL has the same affect as
>> >> >> deleting th
>> >> >> row, the data will physically by removed during compaction.
>> >> >>
>> >> >> Aaron
>> >> >>
>> >> >>
>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs 
>> wrote:
>> >> >>
>> >> >> I don't think this is supported (but I could be completely wrong).
>> >> >> However, I'd love to see this functionality as well.
>> >> >>
>> >> >> How would one go about requesting such a feature?
>> >> >>
>> >> >> Bill-
>> >> >>
>> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
>> >> >>  wrote:
>> >> >>> Hey,
>> >> >>>
>> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my case
>> I'd
>> >> >>> like to expire an entire row automatically after a certain amount
>> of
>> >> >>> time. Is this possible as well?
>> >> >>>
>> >> >>> Thanks,
>> >> >>> -Kal
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: Finding the intersection results of column sets of two rows

2011-02-08 Thread Aklin_81

Amongst two rows, where I need to find the common columns. I will not
have more than 200 columns(in 99% cases) for the 1st row. But the 2nd
row where I need to find these columns may have even around a million
valueless columns.

A point to note is:- These calculations are all done for **writing the
data to the database that has been collected from presentation layer**
& not while presentation of data.

I am using the results of such intersection to find the rows(that are
pointed by names of common columns) that I should write to. The
calculations are done after a Post is submitted by a user, in a
discussions forum. Actually this is used to find out the mutual
connections in a group & write to the rows pointed by common columns.
1st row represents the connection list of a user, which is not going
to be more than 100-250 columns for my case & 2nd row represents the
members of a group which may contain a million columns as I told.
I find the mutual connections in a group(by finding the common columns
in the above two rows) and then write to the rows of those users.

Cant I run a batch query to ask for all columns that I picked up from
1st row and want to ask in the 2nd row ??

Is there any better way ?

Asil

>
> On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote:
>
>> Thanks Aaron & Shaun,
>>
>> **
>> I think my question might have been unclear to some of you. So I would
>> again explain my problem(& solution which I thought of) for the sake
>> of clarity:-
>>
>> Consider I have 2 rows.  1st row contains 60-70 columns and 2nd row
>> contains like in hundreds of thousands columns. Both the columns sets
>> are all valueless. I need to just findout the **common column names**
>> in the two rows. **These two rows are known to me**. So what I plan to
>> do is, I just pick up all **columns (names)** of 1st row (60 -70
>> columns) and just ask for them in 2nd row, whatever column names I get
>> back is my result.
>> Would there be any problem with this solution ? This is how I am
>> expecting to get common column names.
>>
>> Please do not consider it as a JOIN case as it leads to unnecessary
>> confusions, I just need common column names from valueless columns in
>> the two rows.
>>
>> 
>>
>> Aaron, actually the intersection data is very much context based. So
>> say if there are 10 million rows in CF A & 1 million in CF B, then
>> intersection data would be containing 10 million *1 million rows. This
>> would involve very huge & unaffordable amounts of denormalization.
>> And finding columns in client would require pulling unnecessary
>> columns like pulling 100,000 columns from a row of which only 60-70
>> are required .
>>
>> Shaun, I hope my above clarification has clarified things a bit. Yes,
>> the rows, of which I need to find common columns are known to me.
>>
>>
>> Thank you all,
>> Asil
>>
>>
>> On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts  wrote:
>>> In theory, you should be able to do joins by creating an extra column in 
>>> one column family, holding the "foreign key" of the matching row in the 
>>> other family.
>>>
>>> This assumes that the info you are joining on is available in both CFs (is 
>>> not some sort of functional transformation).
>>>
>>> I have just found that the implementation for secondary indexes is not yet 
>>> very close to optimal for more complex "joins" involving multiple indexes, 
>>> I'm not sure if that affects you as you didn't say what you are joining on.
>>>
>>> -- Shaun
>>>
>>>
>>> On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote:
>>>
 Is it possible for you to dernormalise and write all the intersection 
 values? Will depend on how many I guess.

 The other alternative is to pull back more data that you need and the 
 intersection in code in the client.

 Hope that helps.
 Aaron
 On 7/02/2011, at 7:11 AM, Aklin_81  wrote:

> Hi,
>
> @buddhasystem : yes that's well known solution. But obviously when
> mysql couldnt satisfy my needs, I am here. My question is in context
> of Cassandra, if it possible to achieve intersection result set of
> columns in two rows, by the way I spoke about.
>
> @Edward: yes that I know but how does that fit here for obtaining the
> common columns among two rows.
>
> Thanks for your comments..
>
> -Asil
>
>
> On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo  
> wrote:
>> On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem  wrote:
>>>
>>> Hello,
>>>
>>> If the amount of data is _that_ small, you'll have a much easier life 
>>> with
>>> MySQL, which supports the "join" procedure -- because that's exactly 
>>> what
>>> you want to achieve.
>>>
>>>
>>> asil klin wrote:

 Hi all,

 I want to procure the intersection of columns set of two rows (from 2
 different column families).

 To achieve

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

I'm trying to set the gc_grace_seconds column family parameter but no
luck.. I got the name of it from the comment in cassandra.yaml:

# - gc_grace_seconds: specifies the time to wait before garbage
#collecting tombstones (deletion markers). defaults to 864000 (10
#days). See http://wiki.apache.org/cassandra/DistributedDeletes

create column family Session
with comparator = UTF8Type
and keys_cached = 1
and memtable_flush_after = 1440
and memtable_throughput = 32
and gc_grace_seconds = 60;

error is 'No enum const class
org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.

Thanks,
-Kal

On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne  wrote:
>> I hope you don't consider this a hijack of the thread...
>>
>> What I'd like to know is the following:
>>
>> The GC removes TTL rows some time after they expire, at its convenience.
>> But will they stop being returned as soon as they expire? (This is the
>> expected behavior...)
>
> It is the individual column that have TTL. When a column expires, it becomes
> a delete tombstone. Now, a row with tombstones (even only them) will show
> during range request. But the explanation is
> here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
>
>>
>> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
>>  wrote:
>>>
>>> So the empty row will be ultimately removed then? Is there a way to
>>> for the GC to verify this?
>>>
>>> Thanks,
>>> -Kal
>>>
>>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
>>> > The expired columns were converted into tombstones, which will live for
>>> > the
>>> > GC timeout. The "empty" row will be cleaned up when those tombstones
>>> > are
>>> > removed.
>>> > Returning the empty row is unfortunate... we'd love to find a more
>>> > appropriate solution that might not involve endless scanning.
>>> > See
>>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>>> >
>>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>>> >  wrote:
>>> >>
>>> >> I also tried forcing a major compaction on the column family using JMX
>>> >> but the row remains.
>>> >>
>>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>>> >>  wrote:
>>> >> > I tried that but I still see the row coming back on a list
>>> >> >  in the CLI. My concern is that there will be a
>>> >> > pointer
>>> >> > to an empty row for all eternity.
>>> >> >
>>> >> > -Kal
>>> >> >
>>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton
>>> >> > 
>>> >> > wrote:
>>> >> >> Deleting all the columns in a row via TTL has the same affect as
>>> >> >> deleting th
>>> >> >> row, the data will physically by removed during compaction.
>>> >> >>
>>> >> >> Aaron
>>> >> >>
>>> >> >>
>>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs 
>>> >> >> wrote:
>>> >> >>
>>> >> >> I don't think this is supported (but I could be completely wrong).
>>> >> >> However, I'd love to see this functionality as well.
>>> >> >>
>>> >> >> How would one go about requesting such a feature?
>>> >> >>
>>> >> >> Bill-
>>> >> >>
>>> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
>>> >> >>  wrote:
>>> >> >>> Hey,
>>> >> >>>
>>> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my case
>>> >> >>> I'd
>>> >> >>> like to expire an entire row automatically after a certain amount
>>> >> >>> of
>>> >> >>> time. Is this possible as well?
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> -Kal
>>> >> >>>
>>> >> >>
>>> >> >
>>> >
>>> >
>>
>
>

Re: Best Approaches for Developer Integration

2011-02-08 Thread Eric Evans

On Mon, 2011-02-07 at 22:28 -0800, Paul Querna wrote:
> For example, for CouchDB has CouchDBX 
> which at least on OSX present a very easy to use installer, data
> browser, and GUI.  You just run CouchDBX.app, and then your
> application can build out the rest of your data as needed for
> development.
> 
> So, I guess this is coming down to:
>   1) Has anyone built any easy to install packages of Cassandra? 

I'm sure you already know this, but for the benefit of others, users of
Debian-based systems (yes, some of us do develop on Linux :) can apt-get
a package from the projects repository[1].

Installing the package is enough to get a completely functional node
suitable to develop against.

[1]: deb http://www.apache.org/dist/cassandra/debian unstable main

-- 
Eric Evans
eev...@rackspace.com

Re: time to live rows

2011-02-08 Thread Sylvain Lebresne

Not very logically, It's actually gc_grace, not gc_grace_seconds in the CLI.


On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg  wrote:

> I'm trying to set the gc_grace_seconds column family parameter but no
> luck.. I got the name of it from the comment in cassandra.yaml:
>
> # - gc_grace_seconds: specifies the time to wait before garbage
> #collecting tombstones (deletion markers). defaults to 864000 (10
> #days). See http://wiki.apache.org/cassandra/DistributedDeletes
>
> create column family Session
>with comparator = UTF8Type
>and keys_cached = 1
>and memtable_flush_after = 1440
>and memtable_throughput = 32
>and gc_grace_seconds = 60;
>
> error is 'No enum const class
>
> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
>
> Thanks,
> -Kal
>
> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne 
> wrote:
> >> I hope you don't consider this a hijack of the thread...
> >>
> >> What I'd like to know is the following:
> >>
> >> The GC removes TTL rows some time after they expire, at its convenience.
> >> But will they stop being returned as soon as they expire? (This is the
> >> expected behavior...)
> >
> > It is the individual column that have TTL. When a column expires, it
> becomes
> > a delete tombstone. Now, a row with tombstones (even only them) will show
> > during range request. But the explanation is
> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
> >
> >>
> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
> >>  wrote:
> >>>
> >>> So the empty row will be ultimately removed then? Is there a way to
> >>> for the GC to verify this?
> >>>
> >>> Thanks,
> >>> -Kal
> >>>
> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
> >>> > The expired columns were converted into tombstones, which will live
> for
> >>> > the
> >>> > GC timeout. The "empty" row will be cleaned up when those tombstones
> >>> > are
> >>> > removed.
> >>> > Returning the empty row is unfortunate... we'd love to find a more
> >>> > appropriate solution that might not involve endless scanning.
> >>> > See
> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
> >>> >
> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
> >>> >  wrote:
> >>> >>
> >>> >> I also tried forcing a major compaction on the column family using
> JMX
> >>> >> but the row remains.
> >>> >>
> >>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
> >>> >>  wrote:
> >>> >> > I tried that but I still see the row coming back on a list
> >>> >> >  in the CLI. My concern is that there will be a
> >>> >> > pointer
> >>> >> > to an empty row for all eternity.
> >>> >> >
> >>> >> > -Kal
> >>> >> >
> >>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton
> >>> >> > 
> >>> >> > wrote:
> >>> >> >> Deleting all the columns in a row via TTL has the same affect as
> >>> >> >> deleting th
> >>> >> >> row, the data will physically by removed during compaction.
> >>> >> >>
> >>> >> >> Aaron
> >>> >> >>
> >>> >> >>
> >>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs 
> >>> >> >> wrote:
> >>> >> >>
> >>> >> >> I don't think this is supported (but I could be completely
> wrong).
> >>> >> >> However, I'd love to see this functionality as well.
> >>> >> >>
> >>> >> >> How would one go about requesting such a feature?
> >>> >> >>
> >>> >> >> Bill-
> >>> >> >>
> >>> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
> >>> >> >>  wrote:
> >>> >> >>> Hey,
> >>> >> >>>
> >>> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my
> case
> >>> >> >>> I'd
> >>> >> >>> like to expire an entire row automatically after a certain
> amount
> >>> >> >>> of
> >>> >> >>> time. Is this possible as well?
> >>> >> >>>
> >>> >> >>> Thanks,
> >>> >> >>> -Kal
> >>> >> >>>
> >>> >> >>
> >>> >> >
> >>> >
> >>> >
> >>
> >
> >
>

Re: OOM during batch_mutate

2011-02-08 Thread Chris Burroughs

On 02/07/2011 06:05 PM, Jonathan Ellis wrote:
> Sounds like the keyspace was created on the 32GB machine, so it
> guessed memtable sizes that are too large when run on the 16GB one.
> Use "update column family" from the cli to cut the throughput and
> operations thresholds in half, or to 1/4 to be cautious.

This guessing is new in 0.7.x right?  On a 0.6.x storage-conf.xml +
sstables can be moved among machines with different amounts of RAM
without needing to change anything through the cli?

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

Thanks, gc_grace works in the CLI.

However, I'm not observing the desired effect. I am setting TTL on a
single column in my column family, and I see the columns disappear
when using 'list Session' (my columnfamily) in the CLI. I created the
column family with gc_grace = 60, and after observing for a few
minutes I am still seeing all the rows come back, none of them with
columns. I was hoping the GC would delete the empty rows.

-Kal

On Tue, Feb 8, 2011 at 11:39 AM, Sylvain Lebresne  wrote:
> Not very logically, It's actually gc_grace, not gc_grace_seconds in the CLI.
>
> On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg
>  wrote:
>>
>> I'm trying to set the gc_grace_seconds column family parameter but no
>> luck.. I got the name of it from the comment in cassandra.yaml:
>>
>> #     - gc_grace_seconds: specifies the time to wait before garbage
>> #        collecting tombstones (deletion markers). defaults to 864000 (10
>> #        days). See http://wiki.apache.org/cassandra/DistributedDeletes
>>
>> create column family Session
>>    with comparator = UTF8Type
>>    and keys_cached = 1
>>    and memtable_flush_after = 1440
>>    and memtable_throughput = 32
>>        and gc_grace_seconds = 60;
>>
>> error is 'No enum const class
>>
>> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
>>
>> Thanks,
>> -Kal
>>
>> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne 
>> wrote:
>> >> I hope you don't consider this a hijack of the thread...
>> >>
>> >> What I'd like to know is the following:
>> >>
>> >> The GC removes TTL rows some time after they expire, at its
>> >> convenience.
>> >> But will they stop being returned as soon as they expire? (This is the
>> >> expected behavior...)
>> >
>> > It is the individual column that have TTL. When a column expires, it
>> > becomes
>> > a delete tombstone. Now, a row with tombstones (even only them) will
>> > show
>> > during range request. But the explanation is
>> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
>> >
>> >>
>> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
>> >>  wrote:
>> >>>
>> >>> So the empty row will be ultimately removed then? Is there a way to
>> >>> for the GC to verify this?
>> >>>
>> >>> Thanks,
>> >>> -Kal
>> >>>
>> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
>> >>> > The expired columns were converted into tombstones, which will live
>> >>> > for
>> >>> > the
>> >>> > GC timeout. The "empty" row will be cleaned up when those tombstones
>> >>> > are
>> >>> > removed.
>> >>> > Returning the empty row is unfortunate... we'd love to find a more
>> >>> > appropriate solution that might not involve endless scanning.
>> >>> > See
>> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>> >>> >
>> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>> >>> >  wrote:
>> >>> >>
>> >>> >> I also tried forcing a major compaction on the column family using
>> >>> >> JMX
>> >>> >> but the row remains.
>> >>> >>
>> >>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>> >>> >>  wrote:
>> >>> >> > I tried that but I still see the row coming back on a list
>> >>> >> >  in the CLI. My concern is that there will be a
>> >>> >> > pointer
>> >>> >> > to an empty row for all eternity.
>> >>> >> >
>> >>> >> > -Kal
>> >>> >> >
>> >>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton
>> >>> >> > 
>> >>> >> > wrote:
>> >>> >> >> Deleting all the columns in a row via TTL has the same affect as
>> >>> >> >> deleting th
>> >>> >> >> row, the data will physically by removed during compaction.
>> >>> >> >>
>> >>> >> >> Aaron
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs 
>> >>> >> >> wrote:
>> >>> >> >>
>> >>> >> >> I don't think this is supported (but I could be completely
>> >>> >> >> wrong).
>> >>> >> >> However, I'd love to see this functionality as well.
>> >>> >> >>
>> >>> >> >> How would one go about requesting such a feature?
>> >>> >> >>
>> >>> >> >> Bill-
>> >>> >> >>
>> >>> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
>> >>> >> >>  wrote:
>> >>> >> >>> Hey,
>> >>> >> >>>
>> >>> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my
>> >>> >> >>> case
>> >>> >> >>> I'd
>> >>> >> >>> like to expire an entire row automatically after a certain
>> >>> >> >>> amount
>> >>> >> >>> of
>> >>> >> >>> time. Is this possible as well?
>> >>> >> >>>
>> >>> >> >>> Thanks,
>> >>> >> >>> -Kal
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >
>> >>> >
>> >>> >
>> >>
>> >
>> >
>
>

Re: Cassandra memory consumption

2011-02-08 Thread Jonathan Ellis

I missed the part where you explained where you're getting your numbers from.

On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
 wrote:
> It is really weird that I am the only one to have this issue.
> I restarted Cassandra today and already the memory compution is over the
> limit :
>
> root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
> org.apache.cassandra.thrift.CassandraDaemon
>
> It is really an annoying problem if we cannot really foresee memory
> consumption.
>
> Best regards,
> Victor K
>
> 2011/2/8 Victor Kabdebon 
>>
>> Dear all,
>>
>> Sorry to come back again to this point but I am really worried about
>> Cassandra memory consumption. I have a single machine that runs one
>> Cassandra server. There is almost no data on it but I see a crazy memory
>> consumption and it doesn't care at all about the instructions...
>> Note that I am not using mmap, but "Standard", I use also JNA (inside lib
>> folder), i am running on debian 5 64 bits, so a pretty normal configuration.
>> I also use Cassandra 0.6.8.
>>
>>
>> Here are the informations I gathered on Cassandra :
>>
>> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58
>> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
>> org.apache.cassandra.thrift.CassandraDaemon
>>
>> result of nodetool info :
>>
>> 116024732779488843382476400091948985708
>> Load : 1,94 MB
>> Generation No    : 1296673772
>> Uptime (seconds) : 467550
>> Heap Memory (MB) : 120,26 / 253,94
>>
>>
>> I have about 21 column families, none of them have a lot of information (
>> as you see I have 2 Mb of text which is really small). Even if I set Xmx at
>> 256 there is 687M of memory used. Where does this memory come from ? Bad
>> garbage collection ? Something that I ignore ?
>> Thank you for your help I really need to get rid of that problem.
>>
>> Best regards,
>> Victor Kabdebon
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Best Approaches for Developer Integration

2011-02-08 Thread Jonathan Ellis

On Tue, Feb 8, 2011 at 10:38 AM, Eric Evans  wrote:
> I'm sure you already know this, but for the benefit of others, users of
> Debian-based systems (yes, some of us do develop on Linux :) can apt-get
> a package from the projects repository[1].
>
> Installing the package is enough to get a completely functional node
> suitable to develop against.
>
> [1]: deb http://www.apache.org/dist/cassandra/debian unstable main

And rpms are available from http://rpm.riptano.com/

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Cassandra memory consumption

2011-02-08 Thread Ryan King

Which jvm and version are you using?

-ryan

On Tue, Feb 8, 2011 at 7:32 AM, Victor Kabdebon
 wrote:
> It is really weird that I am the only one to have this issue.
> I restarted Cassandra today and already the memory compution is over the
> limit :
>
> root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
> org.apache.cassandra.thrift.CassandraDaemon
>
> It is really an annoying problem if we cannot really foresee memory
> consumption.
>
> Best regards,
> Victor K
>
> 2011/2/8 Victor Kabdebon 
>>
>> Dear all,
>>
>> Sorry to come back again to this point but I am really worried about
>> Cassandra memory consumption. I have a single machine that runs one
>> Cassandra server. There is almost no data on it but I see a crazy memory
>> consumption and it doesn't care at all about the instructions...
>> Note that I am not using mmap, but "Standard", I use also JNA (inside lib
>> folder), i am running on debian 5 64 bits, so a pretty normal configuration.
>> I also use Cassandra 0.6.8.
>>
>>
>> Here are the informations I gathered on Cassandra :
>>
>> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58
>> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
>> org.apache.cassandra.thrift.CassandraDaemon
>>
>> result of nodetool info :
>>
>> 116024732779488843382476400091948985708
>> Load : 1,94 MB
>> Generation No    : 1296673772
>> Uptime (seconds) : 467550
>> Heap Memory (MB) : 120,26 / 253,94
>>
>>
>> I have about 21 column families, none of them have a lot of information (
>> as you see I have 2 Mb of text which is really small). Even if I set Xmx at
>> 256 there is 687M of memory used. Where does this memory come from ? Bad
>> garbage collection ? Something that I ignore ?
>> Thank you for your help I really need to get rid of that problem.
>>
>> Best regards,
>> Victor Kabdebon
>
>



-- 
@rk

Re: OOM during batch_mutate

2011-02-08 Thread Jonathan Ellis

No, on 0.6 copying settings for a 32GB machine to a 16GB machine would
also be a great way to OOM.  The difference is that you had to set
memtable thresholds globally in the xml file in 0.6, instead of being
able to do it per-columnfamily from the cli.

On Tue, Feb 8, 2011 at 10:40 AM, Chris Burroughs
 wrote:
> On 02/07/2011 06:05 PM, Jonathan Ellis wrote:
>> Sounds like the keyspace was created on the 32GB machine, so it
>> guessed memtable sizes that are too large when run on the 16GB one.
>> Use "update column family" from the cli to cut the throughput and
>> operations thresholds in half, or to 1/4 to be cautious.
>
>
> This guessing is new in 0.7.x right?  On a 0.6.x storage-conf.xml +
> sstables can be moved among machines with different amounts of RAM
> without needing to change anything through the cli?
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Cassandra memory consumption

2011-02-08 Thread Victor Kabdebon

Sorry Jonathan :

So most of these informations were taken using the command :

sudo ps aux | grep cassandra

For the nodetool information it is :

/bin/nodetool --host localhost --port 8081 info


Regars,

Victor K.


2011/2/8 Jonathan Ellis 

> I missed the part where you explained where you're getting your numbers
> from.
>
> On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
>  wrote:
> > It is really weird that I am the only one to have this issue.
> > I restarted Cassandra today and already the memory compution is over the
> > limit :
> >
> > root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
> > /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> > -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> > -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
> > -Dcom.sun.management.jmxremote.ssl=false
> > -Dcom.sun.management.jmxremote.authenticate=false
> > -Dstorage-config=bin/../conf -cp
> >
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
> > org.apache.cassandra.thrift.CassandraDaemon
> >
> > It is really an annoying problem if we cannot really foresee memory
> > consumption.
> >
> > Best regards,
> > Victor K
> >
> > 2011/2/8 Victor Kabdebon 
> >>
> >> Dear all,
> >>
> >> Sorry to come back again to this point but I am really worried about
> >> Cassandra memory consumption. I have a single machine that runs one
> >> Cassandra server. There is almost no data on it but I see a crazy memory
> >> consumption and it doesn't care at all about the instructions...
> >> Note that I am not using mmap, but "Standard", I use also JNA (inside
> lib
> >> folder), i am running on debian 5 64 bits, so a pretty normal
> configuration.
> >> I also use Cassandra 0.6.8.
> >>
> >>
> >> Here are the informations I gathered on Cassandra :
> >>
> >> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58
> >> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> >> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> >> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> >> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
> >> -Dcom.sun.management.jmxremote.ssl=false
> >> -Dcom.sun.management.jmxremote.authenticate=false
> >> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
> >>
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
> >> org.apache.cassandra.thrift.CassandraDaemon
> >>
> >> result of nodetool info :
> >>
> >> 116024732779488843382476400091948985708
> >> Load : 1,94 MB
> >> Generation No: 1296673772
> >> Uptime (seconds) : 467550
> >> Heap Memory (MB) : 120,26 / 253,94
> >>
> >>
> >> I have about 21 column families, none of them have a lot of information
> (
> >> as you see I have 2 Mb of text which is really small). Even if I set Xmx
> at
> >> 256 there is 687M of memory used. Where does this memory come from ? Bad
> >> garbage collection ? Something that I ignore ?
> >> Thank you for your help I really need to get rid of that problem.
> >>
> >> Best regards,
> >> Victor Kabdebon
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder

Re: Cassandra memory consumption

2011-02-08 Thread Victor Kabdebon

Information on the system :

*Debian 5*
*Jvm :*
victor@testhost:~/database/apache-cassandra-0.6.6$ java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

*RAM :* 2Go


2011/2/8 Victor Kabdebon 

> Sorry Jonathan :
>
> So most of these informations were taken using the command :
>
> sudo ps aux | grep cassandra
>
> For the nodetool information it is :
>
> /bin/nodetool --host localhost --port 8081 info
>
>
> Regars,
>
> Victor K.
>
>
> 2011/2/8 Jonathan Ellis 
>
> I missed the part where you explained where you're getting your numbers
>> from.
>>
>> On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
>>  wrote:
>> > It is really weird that I am the only one to have this issue.
>> > I restarted Cassandra today and already the memory compution is over the
>> > limit :
>> >
>> > root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
>> > /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC
>> > -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1
>> > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>> > -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
>> > -Dcom.sun.management.jmxremote.ssl=false
>> > -Dcom.sun.management.jmxremote.authenticate=false
>> > -Dstorage-config=bin/../conf -cp
>> >
>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
>> > org.apache.cassandra.thrift.CassandraDaemon
>> >
>> > It is really an annoying problem if we cannot really foresee memory
>> > consumption.
>> >
>> > Best regards,
>> > Victor K
>> >
>> > 2011/2/8 Victor Kabdebon 
>> >>
>> >> Dear all,
>> >>
>> >> Sorry to come back again to this point but I am really worried about
>> >> Cassandra memory consumption. I have a single machine that runs one
>> >> Cassandra server. There is almost no data on it but I see a crazy
>> memory
>> >> consumption and it doesn't care at all about the instructions...
>> >> Note that I am not using mmap, but "Standard", I use also JNA (inside
>> lib
>> >> folder), i am running on debian 5 64 bits, so a pretty normal
>> configuration.
>> >> I also use Cassandra 0.6.8.
>> >>
>> >>
>> >> Here are the informations I gathered on Cassandra :
>> >>
>> >> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58
>> >> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC
>> >> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1
>> >> -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly
>> >> -XX:+HeapDumpOnOutOfMemoryError
>> -Dcom.sun.management.jmxremote.port=8081
>> >> -Dcom.sun.management.jmxremote.ssl=false
>> >> -Dcom.sun.management.jmxremote.authenticate=false
>> >> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
>> >>
>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
>> >> org.apache.cassandra.thrift.CassandraDaemon
>> >>
>> >> result of nodetool info :
>> >>
>> >> 116024732779488843382476400091948985708
>> >> Load : 1,94 MB
>> >> Generation No: 1296673772
>> >> Uptime (seconds) : 467550
>> >> Heap Memory (MB) : 120,26 / 253,94
>> >>
>> >>
>> >> I have about 21 column families, none of them have a lot of information
>> (
>> >> as yo

Re: time to live rows

2011-02-08 Thread Sylvain Lebresne

Did you force a major compaction (with jconsole or nodetool) after gc_grace
has elapsed ?

On Tue, Feb 8, 2011 at 5:46 PM, Kallin Nagelberg  wrote:

> Thanks, gc_grace works in the CLI.
>
> However, I'm not observing the desired effect. I am setting TTL on a
> single column in my column family, and I see the columns disappear
> when using 'list Session' (my columnfamily) in the CLI. I created the
> column family with gc_grace = 60, and after observing for a few
> minutes I am still seeing all the rows come back, none of them with
> columns. I was hoping the GC would delete the empty rows.
>
> -Kal
>
> On Tue, Feb 8, 2011 at 11:39 AM, Sylvain Lebresne 
> wrote:
> > Not very logically, It's actually gc_grace, not gc_grace_seconds in the
> CLI.
> >
> > On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg
> >  wrote:
> >>
> >> I'm trying to set the gc_grace_seconds column family parameter but no
> >> luck.. I got the name of it from the comment in cassandra.yaml:
> >>
> >> # - gc_grace_seconds: specifies the time to wait before garbage
> >> #collecting tombstones (deletion markers). defaults to 864000
> (10
> >> #days). See http://wiki.apache.org/cassandra/DistributedDeletes
> >>
> >> create column family Session
> >>with comparator = UTF8Type
> >>and keys_cached = 1
> >>and memtable_flush_after = 1440
> >>and memtable_throughput = 32
> >>and gc_grace_seconds = 60;
> >>
> >> error is 'No enum const class
> >>
> >>
> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
> >>
> >> Thanks,
> >> -Kal
> >>
> >> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne  >
> >> wrote:
> >> >> I hope you don't consider this a hijack of the thread...
> >> >>
> >> >> What I'd like to know is the following:
> >> >>
> >> >> The GC removes TTL rows some time after they expire, at its
> >> >> convenience.
> >> >> But will they stop being returned as soon as they expire? (This is
> the
> >> >> expected behavior...)
> >> >
> >> > It is the individual column that have TTL. When a column expires, it
> >> > becomes
> >> > a delete tombstone. Now, a row with tombstones (even only them) will
> >> > show
> >> > during range request. But the explanation is
> >> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
> >> >
> >> >>
> >> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
> >> >>  wrote:
> >> >>>
> >> >>> So the empty row will be ultimately removed then? Is there a way to
> >> >>> for the GC to verify this?
> >> >>>
> >> >>> Thanks,
> >> >>> -Kal
> >> >>>
> >> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
> >> >>> > The expired columns were converted into tombstones, which will
> live
> >> >>> > for
> >> >>> > the
> >> >>> > GC timeout. The "empty" row will be cleaned up when those
> tombstones
> >> >>> > are
> >> >>> > removed.
> >> >>> > Returning the empty row is unfortunate... we'd love to find a more
> >> >>> > appropriate solution that might not involve endless scanning.
> >> >>> > See
> >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
> >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
> >> >>> >
> >> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
> >> >>> >  wrote:
> >> >>> >>
> >> >>> >> I also tried forcing a major compaction on the column family
> using
> >> >>> >> JMX
> >> >>> >> but the row remains.
> >> >>> >>
> >> >>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
> >> >>> >>  wrote:
> >> >>> >> > I tried that but I still see the row coming back on a list
> >> >>> >> >  in the CLI. My concern is that there will be a
> >> >>> >> > pointer
> >> >>> >> > to an empty row for all eternity.
> >> >>> >> >
> >> >>> >> > -Kal
> >> >>> >> >
> >> >>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton
> >> >>> >> > 
> >> >>> >> > wrote:
> >> >>> >> >> Deleting all the columns in a row via TTL has the same affect
> as
> >> >>> >> >> deleting th
> >> >>> >> >> row, the data will physically by removed during compaction.
> >> >>> >> >>
> >> >>> >> >> Aaron
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs <
> bill.spe...@gmail.com>
> >> >>> >> >> wrote:
> >> >>> >> >>
> >> >>> >> >> I don't think this is supported (but I could be completely
> >> >>> >> >> wrong).
> >> >>> >> >> However, I'd love to see this functionality as well.
> >> >>> >> >>
> >> >>> >> >> How would one go about requesting such a feature?
> >> >>> >> >>
> >> >>> >> >> Bill-
> >> >>> >> >>
> >> >>> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
> >> >>> >> >>  wrote:
> >> >>> >> >>> Hey,
> >> >>> >> >>>
> >> >>> >> >>> I have read about the new TTL columns in Cassandra 0.7. In my
> >> >>> >> >>> case
> >> >>> >> >>> I'd
> >> >>> >> >>> like to expire an entire row automatically after a certain
> >> >>> >> >>> amount
> >> >>> >> >>> of
> >> >>> >> >>> time. Is this possible as well?
> >> >>> >> >>>
> >> >>> >> >>> Thanks,
> >> >>> >> >>> -Kal
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >> >
> >> >>> >
> >>

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

Yes I did, on the org.apache.cassandra.db.ColumnFamilies.Main.Session object.

-Kal

On Tue, Feb 8, 2011 at 12:00 PM, Sylvain Lebresne  wrote:
> Did you force a major compaction (with jconsole or nodetool) after gc_grace
> has elapsed ?
> On Tue, Feb 8, 2011 at 5:46 PM, Kallin Nagelberg
>  wrote:
>>
>> Thanks, gc_grace works in the CLI.
>>
>> However, I'm not observing the desired effect. I am setting TTL on a
>> single column in my column family, and I see the columns disappear
>> when using 'list Session' (my columnfamily) in the CLI. I created the
>> column family with gc_grace = 60, and after observing for a few
>> minutes I am still seeing all the rows come back, none of them with
>> columns. I was hoping the GC would delete the empty rows.
>>
>> -Kal
>>
>> On Tue, Feb 8, 2011 at 11:39 AM, Sylvain Lebresne 
>> wrote:
>> > Not very logically, It's actually gc_grace, not gc_grace_seconds in the
>> > CLI.
>> >
>> > On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg
>> >  wrote:
>> >>
>> >> I'm trying to set the gc_grace_seconds column family parameter but no
>> >> luck.. I got the name of it from the comment in cassandra.yaml:
>> >>
>> >> #     - gc_grace_seconds: specifies the time to wait before garbage
>> >> #        collecting tombstones (deletion markers). defaults to 864000
>> >> (10
>> >> #        days). See http://wiki.apache.org/cassandra/DistributedDeletes
>> >>
>> >> create column family Session
>> >>    with comparator = UTF8Type
>> >>    and keys_cached = 1
>> >>    and memtable_flush_after = 1440
>> >>    and memtable_throughput = 32
>> >>        and gc_grace_seconds = 60;
>> >>
>> >> error is 'No enum const class
>> >>
>> >>
>> >> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
>> >>
>> >> Thanks,
>> >> -Kal
>> >>
>> >> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne
>> >> 
>> >> wrote:
>> >> >> I hope you don't consider this a hijack of the thread...
>> >> >>
>> >> >> What I'd like to know is the following:
>> >> >>
>> >> >> The GC removes TTL rows some time after they expire, at its
>> >> >> convenience.
>> >> >> But will they stop being returned as soon as they expire? (This is
>> >> >> the
>> >> >> expected behavior...)
>> >> >
>> >> > It is the individual column that have TTL. When a column expires, it
>> >> > becomes
>> >> > a delete tombstone. Now, a row with tombstones (even only them) will
>> >> > show
>> >> > during range request. But the explanation is
>> >> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
>> >> >
>> >> >>
>> >> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
>> >> >>  wrote:
>> >> >>>
>> >> >>> So the empty row will be ultimately removed then? Is there a way to
>> >> >>> for the GC to verify this?
>> >> >>>
>> >> >>> Thanks,
>> >> >>> -Kal
>> >> >>>
>> >> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
>> >> >>> > The expired columns were converted into tombstones, which will
>> >> >>> > live
>> >> >>> > for
>> >> >>> > the
>> >> >>> > GC timeout. The "empty" row will be cleaned up when those
>> >> >>> > tombstones
>> >> >>> > are
>> >> >>> > removed.
>> >> >>> > Returning the empty row is unfortunate... we'd love to find a
>> >> >>> > more
>> >> >>> > appropriate solution that might not involve endless scanning.
>> >> >>> > See
>> >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>> >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>> >> >>> >
>> >> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>> >> >>> >  wrote:
>> >> >>> >>
>> >> >>> >> I also tried forcing a major compaction on the column family
>> >> >>> >> using
>> >> >>> >> JMX
>> >> >>> >> but the row remains.
>> >> >>> >>
>> >> >>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>> >> >>> >>  wrote:
>> >> >>> >> > I tried that but I still see the row coming back on a list
>> >> >>> >> >  in the CLI. My concern is that there will be a
>> >> >>> >> > pointer
>> >> >>> >> > to an empty row for all eternity.
>> >> >>> >> >
>> >> >>> >> > -Kal
>> >> >>> >> >
>> >> >>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton
>> >> >>> >> > 
>> >> >>> >> > wrote:
>> >> >>> >> >> Deleting all the columns in a row via TTL has the same affect
>> >> >>> >> >> as
>> >> >>> >> >> deleting th
>> >> >>> >> >> row, the data will physically by removed during compaction.
>> >> >>> >> >>
>> >> >>> >> >> Aaron
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs
>> >> >>> >> >> 
>> >> >>> >> >> wrote:
>> >> >>> >> >>
>> >> >>> >> >> I don't think this is supported (but I could be completely
>> >> >>> >> >> wrong).
>> >> >>> >> >> However, I'd love to see this functionality as well.
>> >> >>> >> >>
>> >> >>> >> >> How would one go about requesting such a feature?
>> >> >>> >> >>
>> >> >>> >> >> Bill-
>> >> >>> >> >>
>> >> >>> >> >> On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg
>> >> >>> >> >>  wrote:
>> >> >>> >> >>> Hey,
>> >> >>> >> >>>
>> >> >>> >> >>> I have read about the new TTL columns in Cassan

Re: Can serialized objects in columns serve as ersatz superCFs?

2011-02-08 Thread Dave Revell

Yes, this works well for me. I have no SCFs but many columns contain JSON.

Depending on your time/space/compatibility tradeoffs you can obviously pick
you own serialization method.

Best,
Dave
On Feb 8, 2011 4:33 AM, "buddhasystem"  wrote:
>
> Seeing that discussion here about indexes not supported in superCFs, and
less
> than clear future of superCFs altogether, I was thinking about getting a
> modicum of same functionality with serialized objects inside columns. This
> way the column key becomes sort of analog of supercolumn key, and I handle
> the dictionaries I receive in the client.
>
> Does this sound OK?
>
> --
> View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Can-serialized-objects-in-columns-serve-as-ersatz-superCFs-tp6003775p6003775.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.

Re: Can serialized objects in columns serve as ersatz superCFs?

2011-02-08 Thread buddhasystem


Thanks for the comment! In my case, I want to store various time slices as
indexes, so the content can be serialized as comma-separated concatenation
of unique object IDs. Example: on 20101204, multiple clouds experienced a
variety of errors in job execution. In addition, multiple users ran (or
failed) on different clouds. If I combine user id, cloud id and error code,
I can relatively easily drill for errors on a particular date. So each CF
maps to a date, and each column in it is a compound index.

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Can-serialized-objects-in-columns-serve-as-ersatz-superCFs-tp6003775p6004834.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: Best Approaches for Developer Integration

2011-02-08 Thread Sal Fuentes

Perhaps some of you may already be aware, but for the benefits of others:

1) https://github.com/fauna/cassandra does have a cassandra_helper script
which will download and install Cassandra for development/testing purposes
(although the cassandra_script might need to be updated to use 0.7)

2) For people on the Mac, there is a Homebrew (
http://mxcl.github.com/homebrew/) formula/package available for Cassandra
which makes installing simply: brew install cassandra

On Tue, Feb 8, 2011 at 8:49 AM, Jonathan Ellis  wrote:

> On Tue, Feb 8, 2011 at 10:38 AM, Eric Evans  wrote:
> > I'm sure you already know this, but for the benefit of others, users of
> > Debian-based systems (yes, some of us do develop on Linux :) can apt-get
> > a package from the projects repository[1].
> >
> > Installing the package is enough to get a completely functional node
> > suitable to develop against.
> >
> > [1]: deb http://www.apache.org/dist/cassandra/debian unstable main
>
> And rpms are available from http://rpm.riptano.com/
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>



-- 
Salvador Fuentes Jr.

Re: Subcolumn Indexing

2011-02-08 Thread Benjamin Coverston



Does this just mean the exhaustive list of the column names not all 
the values?


No, this means the entire supercolumn, names and values. When the client 
tries to access any subcolumn in the supercolumn it has to read the 
entire supercolumn.


So if I have a super column that has a map of keys that only contain 
two columns max each this shouldn't really be a performance concern 
correct? This becomes an issue when you have lots of subcolumns if I'm 
reading this correctly? I'm looking at using the super column as a 
good way to cluster data, say I was storing home addresses I might use 
the zipcode as the super column if I cared mostly about accessing data 
by logical area for instance. Thanks for the help.


That is one way to logically group the values, but I think that a 
simpler solution may be to store the home address as a single row in a 
column family, then use a dynamic column family to store references to 
those addresses per zip code.



jt

___

This e-mail may contain information that is confidential, privileged 
or otherwise protected from disclosure. If you are not an intended 
recipient of this e-mail, do not duplicate or redistribute it by any 
means. Please delete it and any attachments and notify the sender that 
you have received it in error. Unless specifically indicated, this 
e-mail is not an offer to buy or sell or a solicitation to buy or sell 
any securities, investment products or other financial product or 
service, an official confirmation of any transaction, or an official 
statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of 
Barclays. This e-mail is subject to terms available at the following 
link: www.barcap.com/emaildisclaimer 
. By messaging with Barclays 
you consent to the foregoing.Barclays Capital is the investment 
banking division of Barclays Bank PLC, a company registered in England 
(number 1026167) with its registered office at 1 Churchill Place, 
London, E14 5HP.This email may relate to or be sent from other members 
of the Barclays Group.//


___

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

I'm thinking if this row expiry notion doesn't pan out then I might
create a 'lastAccessed' column with a secondary index (i think that's
right) on it. Then I can periodically run a query to find all
lastAccessed columns less than a certain value and manually delete
them. Sound reasonable?

-Kal

On Tue, Feb 8, 2011 at 12:09 PM, Kallin Nagelberg
 wrote:
> Yes I did, on the org.apache.cassandra.db.ColumnFamilies.Main.Session object.
>
> -Kal
>
> On Tue, Feb 8, 2011 at 12:00 PM, Sylvain Lebresne  
> wrote:
>> Did you force a major compaction (with jconsole or nodetool) after gc_grace
>> has elapsed ?
>> On Tue, Feb 8, 2011 at 5:46 PM, Kallin Nagelberg
>>  wrote:
>>>
>>> Thanks, gc_grace works in the CLI.
>>>
>>> However, I'm not observing the desired effect. I am setting TTL on a
>>> single column in my column family, and I see the columns disappear
>>> when using 'list Session' (my columnfamily) in the CLI. I created the
>>> column family with gc_grace = 60, and after observing for a few
>>> minutes I am still seeing all the rows come back, none of them with
>>> columns. I was hoping the GC would delete the empty rows.
>>>
>>> -Kal
>>>
>>> On Tue, Feb 8, 2011 at 11:39 AM, Sylvain Lebresne 
>>> wrote:
>>> > Not very logically, It's actually gc_grace, not gc_grace_seconds in the
>>> > CLI.
>>> >
>>> > On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg
>>> >  wrote:
>>> >>
>>> >> I'm trying to set the gc_grace_seconds column family parameter but no
>>> >> luck.. I got the name of it from the comment in cassandra.yaml:
>>> >>
>>> >> #     - gc_grace_seconds: specifies the time to wait before garbage
>>> >> #        collecting tombstones (deletion markers). defaults to 864000
>>> >> (10
>>> >> #        days). See http://wiki.apache.org/cassandra/DistributedDeletes
>>> >>
>>> >> create column family Session
>>> >>    with comparator = UTF8Type
>>> >>    and keys_cached = 1
>>> >>    and memtable_flush_after = 1440
>>> >>    and memtable_throughput = 32
>>> >>        and gc_grace_seconds = 60;
>>> >>
>>> >> error is 'No enum const class
>>> >>
>>> >>
>>> >> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
>>> >>
>>> >> Thanks,
>>> >> -Kal
>>> >>
>>> >> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne
>>> >> 
>>> >> wrote:
>>> >> >> I hope you don't consider this a hijack of the thread...
>>> >> >>
>>> >> >> What I'd like to know is the following:
>>> >> >>
>>> >> >> The GC removes TTL rows some time after they expire, at its
>>> >> >> convenience.
>>> >> >> But will they stop being returned as soon as they expire? (This is
>>> >> >> the
>>> >> >> expected behavior...)
>>> >> >
>>> >> > It is the individual column that have TTL. When a column expires, it
>>> >> > becomes
>>> >> > a delete tombstone. Now, a row with tombstones (even only them) will
>>> >> > show
>>> >> > during range request. But the explanation is
>>> >> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
>>> >> >
>>> >> >>
>>> >> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
>>> >> >>  wrote:
>>> >> >>>
>>> >> >>> So the empty row will be ultimately removed then? Is there a way to
>>> >> >>> for the GC to verify this?
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> -Kal
>>> >> >>>
>>> >> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
>>> >> >>> > The expired columns were converted into tombstones, which will
>>> >> >>> > live
>>> >> >>> > for
>>> >> >>> > the
>>> >> >>> > GC timeout. The "empty" row will be cleaned up when those
>>> >> >>> > tombstones
>>> >> >>> > are
>>> >> >>> > removed.
>>> >> >>> > Returning the empty row is unfortunate... we'd love to find a
>>> >> >>> > more
>>> >> >>> > appropriate solution that might not involve endless scanning.
>>> >> >>> > See
>>> >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>>> >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>>> >> >>> >
>>> >> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>>> >> >>> >  wrote:
>>> >> >>> >>
>>> >> >>> >> I also tried forcing a major compaction on the column family
>>> >> >>> >> using
>>> >> >>> >> JMX
>>> >> >>> >> but the row remains.
>>> >> >>> >>
>>> >> >>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>>> >> >>> >>  wrote:
>>> >> >>> >> > I tried that but I still see the row coming back on a list
>>> >> >>> >> >  in the CLI. My concern is that there will be a
>>> >> >>> >> > pointer
>>> >> >>> >> > to an empty row for all eternity.
>>> >> >>> >> >
>>> >> >>> >> > -Kal
>>> >> >>> >> >
>>> >> >>> >> > On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton
>>> >> >>> >> > 
>>> >> >>> >> > wrote:
>>> >> >>> >> >> Deleting all the columns in a row via TTL has the same affect
>>> >> >>> >> >> as
>>> >> >>> >> >> deleting th
>>> >> >>> >> >> row, the data will physically by removed during compaction.
>>> >> >>> >> >>
>>> >> >>> >> >> Aaron
>>> >> >>> >> >>
>>> >> >>> >> >>
>>> >> >>> >> >> On 08 Feb, 2011,at 10:24 AM, Bill Speirs
>>> >> >>> >> >> 
>>> >> >>> >> >> wrote:
>>> >> >>> >> >>
>>>

Re: Does variation in no of columns in rows over the column family has any performance impact ?

2011-02-08 Thread Aaron Morton

For completeness there are a couple of things in the config file that may be interesting if you run into issues.- column_index_size_in_kb defines how big a row has to get before an index is written for the row. Without an index the entire row must be read to find a column. - in_memory_compaction_limit_in_mb - defines the maximum size of row than can be compacted in memory, larger rows go through a slower compaction process.- sliced_buffer_size_in_kb controls the size of the buffer when slicing columns.  Aaron On 08 Feb, 2011,at 08:03 AM, Daniel Doubleday  wrote:It depends a little on your write pattern:

- Wide rows tend to get distributed over more sstables so more disk reads are necessary. This will become noticeable when you have high io load and reads actually hit the discs.
- If you delete a lot slice query performance might suffer: extreme example: create 2M cols, delete the first 1M and then ask for the first 10.

On Feb 7, 2011, at 7:07 AM, Aditya Narayan wrote:

> Does huge variation in no. of columns in rows, over the column family
> has *any* impact on the performance ?
> 
> Can I have like just 100 columns in some rows and like hundred
> thousands of columns in another set of rows, without any downsides ?

RE: time to live rows

2011-02-08 Thread Jeremiah Jordan

You will have the same problem.  You just have to learn to ignore empty rows 
when you query data.  See articles on delete mentioned earlier.

>>> >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>>> >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts

-Original Message-
From: Kallin Nagelberg [mailto:kallin.nagelb...@gmail.com] 
Sent: Tuesday, February 08, 2011 1:36 PM
To: user@cassandra.apache.org
Subject: Re: time to live rows

I'm thinking if this row expiry notion doesn't pan out then I might
create a 'lastAccessed' column with a secondary index (i think that's
right) on it. Then I can periodically run a query to find all
lastAccessed columns less than a certain value and manually delete
them. Sound reasonable?

-Kal

On Tue, Feb 8, 2011 at 12:09 PM, Kallin Nagelberg
 wrote:
> Yes I did, on the org.apache.cassandra.db.ColumnFamilies.Main.Session object.
>
> -Kal
>
> On Tue, Feb 8, 2011 at 12:00 PM, Sylvain Lebresne  
> wrote:
>> Did you force a major compaction (with jconsole or nodetool) after gc_grace
>> has elapsed ?
>> On Tue, Feb 8, 2011 at 5:46 PM, Kallin Nagelberg
>>  wrote:
>>>
>>> Thanks, gc_grace works in the CLI.
>>>
>>> However, I'm not observing the desired effect. I am setting TTL on a
>>> single column in my column family, and I see the columns disappear
>>> when using 'list Session' (my columnfamily) in the CLI. I created the
>>> column family with gc_grace = 60, and after observing for a few
>>> minutes I am still seeing all the rows come back, none of them with
>>> columns. I was hoping the GC would delete the empty rows.
>>>
>>> -Kal
>>>
>>> On Tue, Feb 8, 2011 at 11:39 AM, Sylvain Lebresne 
>>> wrote:
>>> > Not very logically, It's actually gc_grace, not gc_grace_seconds in the
>>> > CLI.
>>> >
>>> > On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg
>>> >  wrote:
>>> >>
>>> >> I'm trying to set the gc_grace_seconds column family parameter but no
>>> >> luck.. I got the name of it from the comment in cassandra.yaml:
>>> >>
>>> >> #     - gc_grace_seconds: specifies the time to wait before garbage
>>> >> #        collecting tombstones (deletion markers). defaults to 864000
>>> >> (10
>>> >> #        days). See http://wiki.apache.org/cassandra/DistributedDeletes
>>> >>
>>> >> create column family Session
>>> >>    with comparator = UTF8Type
>>> >>    and keys_cached = 1
>>> >>    and memtable_flush_after = 1440
>>> >>    and memtable_throughput = 32
>>> >>        and gc_grace_seconds = 60;
>>> >>
>>> >> error is 'No enum const class
>>> >>
>>> >>
>>> >> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
>>> >>
>>> >> Thanks,
>>> >> -Kal
>>> >>
>>> >> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne
>>> >> 
>>> >> wrote:
>>> >> >> I hope you don't consider this a hijack of the thread...
>>> >> >>
>>> >> >> What I'd like to know is the following:
>>> >> >>
>>> >> >> The GC removes TTL rows some time after they expire, at its
>>> >> >> convenience.
>>> >> >> But will they stop being returned as soon as they expire? (This is
>>> >> >> the
>>> >> >> expected behavior...)
>>> >> >
>>> >> > It is the individual column that have TTL. When a column expires, it
>>> >> > becomes
>>> >> > a delete tombstone. Now, a row with tombstones (even only them) will
>>> >> > show
>>> >> > during range request. But the explanation is
>>> >> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
>>> >> >
>>> >> >>
>>> >> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
>>> >> >>  wrote:
>>> >> >>>
>>> >> >>> So the empty row will be ultimately removed then? Is there a way to
>>> >> >>> for the GC to verify this?
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> -Kal
>>> >> >>>
>>> >> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
>>> >> >>> > The expired columns were converted into tombstones, which will
>>> >> >>> > live
>>> >> >>> > for
>>> >> >>> > the
>>> >> >>> > GC timeout. The "empty" row will be cleaned up when those
>>> >> >>> > tombstones
>>> >> >>> > are
>>> >> >>> > removed.
>>> >> >>> > Returning the empty row is unfortunate... we'd love to find a
>>> >> >>> > more
>>> >> >>> > appropriate solution that might not involve endless scanning.
>>> >> >>> > See
>>> >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
>>> >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>>> >> >>> >
>>> >> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
>>> >> >>> >  wrote:
>>> >> >>> >>
>>> >> >>> >> I also tried forcing a major compaction on the column family
>>> >> >>> >> using
>>> >> >>> >> JMX
>>> >> >>> >> but the row remains.
>>> >> >>> >>
>>> >> >>> >> On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg
>>> >> >>> >>  wrote:
>>> >> >>> >> > I tried that but I still see the row coming back on a list
>>> >> >>> >> >  in the CLI. My concern is that there will be a
>>> >> >>> >> > pointer
>>> >> >>> >> > to an empty row for all eternity.
>>> >> >>> >> >
>>> >> >>> >> > -Kal
>>> >> >>> >> >
>>> >> >>> >> > On Mon, Feb 7, 20

Re: Merging the rows of two column families(with similar attributes) into one ??

2011-02-08 Thread Benjamin Coverston




On 2/4/11 11:58 PM, Ertio Lew wrote:

Yes, a disadvantage of more no. of CF in terms of memory utilization
which I see is: -

if some CF is written less often as compared to other CFs, then the
memtable would consume space in the memory until it is flushed, this
memory space could have been much better used by a CF that's heavily
written and read. And if you try to make the thresholds for flush
smaller then more compactions would be needed.


One more disadvantage here is that with CFs that vary widely in the 
write rate you can also end up with fragmented commit logs which in some 
cases we have seen actually fill up the commit log partition. As a 
consequence one thing to consider would be to lower the commit log flush 
threshold (in minutes) to something lower for the column families that 
do not see heavy use.





On Sat, Feb 5, 2011 at 11:58 AM, Ertio Lew  wrote:

Thanks Tyler !

I could not fully understand the reason why more no of column families
would mean more memory.. if you have under control parameters like
memtable_throughput&  memtable_operations which are set per column
family basis then you can directly control&  adjust by splitting the
memory space between two CFs in proportion to what you would do in
single CF.
Hence there should be no extra memory consumption for multiple CFs
that have been split from single one??

Regarding the compactions, I think even if they are more the size of
the SST files to be compacted is smaller as the data has been split
into two.
Then more compactions but smaller too!!


Then, provided the same amount of data, how can greater no of column
families could be a bad option(if you split the values of parameters
for memory consumption proportionately) ??

--
Regards,
Ertio





On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbs  wrote:

I read somewhere that more no of column families is not a good idea as
it consumes more memory and more compactions to occur

This is primarily true, but not in every case.


But the caching requirements may be different as they cater to two
different features.

This is a great reason to *not* merge them.  Besides the key and row caches,
don't forget about the OS buffer cache.


Is it recommended to merge these two column families into one ?? Thoughts
?

No, this sounds like an anti-pattern to me.  The overhead from having two
separate CFs is not that high.

--
Tyler Hobbs
Software Engineer, DataStax
Maintainer of the pycassa Cassandra Python client library

Re: Merging the rows of two column families(with similar attributes) into one ??

2011-02-08 Thread Ertio Lew

Thanks for adding up Benjamin!

On Wed, Feb 9, 2011 at 1:40 AM, Benjamin Coverston
 wrote:
>
>
> On 2/4/11 11:58 PM, Ertio Lew wrote:
>>
>> Yes, a disadvantage of more no. of CF in terms of memory utilization
>> which I see is: -
>>
>> if some CF is written less often as compared to other CFs, then the
>> memtable would consume space in the memory until it is flushed, this
>> memory space could have been much better used by a CF that's heavily
>> written and read. And if you try to make the thresholds for flush
>> smaller then more compactions would be needed.
>>
>>
> One more disadvantage here is that with CFs that vary widely in the write
> rate you can also end up with fragmented commit logs which in some cases we
> have seen actually fill up the commit log partition. As a consequence one
> thing to consider would be to lower the commit log flush threshold (in
> minutes) to something lower for the column families that do not see heavy
> use.
>
>>
>>
>> On Sat, Feb 5, 2011 at 11:58 AM, Ertio Lew  wrote:
>>>
>>> Thanks Tyler !
>>>
>>> I could not fully understand the reason why more no of column families
>>> would mean more memory.. if you have under control parameters like
>>> memtable_throughput&  memtable_operations which are set per column
>>> family basis then you can directly control&  adjust by splitting the
>>> memory space between two CFs in proportion to what you would do in
>>> single CF.
>>> Hence there should be no extra memory consumption for multiple CFs
>>> that have been split from single one??
>>>
>>> Regarding the compactions, I think even if they are more the size of
>>> the SST files to be compacted is smaller as the data has been split
>>> into two.
>>> Then more compactions but smaller too!!
>>>
>>>
>>> Then, provided the same amount of data, how can greater no of column
>>> families could be a bad option(if you split the values of parameters
>>> for memory consumption proportionately) ??
>>>
>>> --
>>> Regards,
>>> Ertio
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbs  wrote:
>
> I read somewhere that more no of column families is not a good idea as
> it consumes more memory and more compactions to occur

 This is primarily true, but not in every case.

> But the caching requirements may be different as they cater to two
> different features.

 This is a great reason to *not* merge them.  Besides the key and row
 caches,
 don't forget about the OS buffer cache.

> Is it recommended to merge these two column families into one ??
> Thoughts
> ?

 No, this sounds like an anti-pattern to me.  The overhead from having
 two
 separate CFs is not that high.

 --
 Tyler Hobbs
 Software Engineer, DataStax
 Maintainer of the pycassa Cassandra Python client library


>

Re: How do secondary indices work

2011-02-08 Thread Aaron Morton

Moving to the user group.On 08 Feb, 2011,at 11:39 PM, alta...@ceid.upatras.gr wrote:Hello,

I'd like some information about how secondary indices work under the hood.

1) Is data stored in some external data structure, or is it stored in an
actual Cassandra table, as columns within column families?
2) Is data stored sorted or not? How is it partitioned?
3) How can I access index data?

Thanks in a advance,

Alexander Altanis

Re: cassandra-cli (output) broken for super columns

2011-02-08 Thread Aaron Morton

Can you raise a ticket that includes script for Cassandra-cli to insert data 
that reproduces the fault ?

Aaron

On 9/02/2011, at 3:11 AM, Timo Nentwig  wrote:

> 
> On Feb 8, 2011, at 13:41, Stephen Connolly wrote:
> 
>> On 8 February 2011 10:38, Timo Nentwig  wrote:
>>> This is not what it's supposed to be like, is it?
> 
> Looks alright:
> 
>>> [default@foo] get foo[page-field];
>>> => (super_column=20110208,
>>>(column=82f4c650-2d53-11e0-a08b-58b035f3f60d, value=msg1, 
>>> timestamp=1297159430471000)
>>>(column=82f4c650-2d53-11e0-a08b-58b035f3f60e, value=msg2, 
>>> timestamp=1297159437423000)
>>>(column=82f4c650-2d53-11e0-a08b-58b035f3f60f, value=msg3, 
>>> timestamp=1297159439855000))
>>> Returned 1 results.
> 
> Missing first half of column 1 is  and UUID is not printed correctly anymore:
> 
>>> [default@foo] get foo[page-field][20110208];
>>> , value=msg1, timestamp=1297159430471000)
>>> => (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
>>> => (column=???P-S???X?5??, value=msg3, timestamp=1297159439855000)
>>> Returned 3 results.
> 
> Still prints only half of the column:
> 
>>> [default@foo] get 
>>> foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60d];
>>> , value=msg1, timestamp=1297159430471000)
> 
> Applies only to first column?!
> 
>>> [default@foo] get 
>>> foo[page-field][20110208][82f4c650-2d53-11e0-a08b-58b035f3f60e];
>>> => (column=???P-S???X?5??, value=msg2, timestamp=1297159437423000)
>>> 
>>> 
>>>   - name: foo
>>> column_type: Super
>>> compare_with: AsciiType
>>> compare_subcolumns_with: TimeUUIDType
>>> default_validation_class: AsciiType
>> 
>> Is it the ?'s that you are complaining about or is it something else?
>> 
>> If it is the ?'s have you got a mismatch between the character
>> encoding in your shell and UTF-8?
> 
> Nope. See above :) Esp. that the first column isn't printed completely.

Re: How do secondary indices work

2011-02-08 Thread Aaron Morton

AFAIK this was the ticket the original work was done under https://issues.apache.org/jira/browse/CASSANDRA-1415also  http://www.datastax.com/docs/0.7/data_model/secondary_indexesand  http://pycassa.github.com/pycassa/tutorial.html#indexes may help(sorry on reflection the email prob did not need to be moved from dev, my bad)AaronOn 09 Feb, 2011,at 09:16 AM, Aaron Morton  wrote:Moving to the user group.On 08 Feb, 2011,at 11:39 PM, alta...@ceid.upatras.gr wrote:Hello,

I'd like some information about how secondary indices work under the hood.

1) Is data stored in some external data structure, or is it stored in an
actual Cassandra table, as columns within column families?
2) Is data stored sorted or not? How is it partitioned?
3) How can I access index data?

Thanks in a advance,

Alexander Altanis

RE: Subcolumn Indexing

2011-02-08 Thread Jeremy.Truelove

Thanks, I just wanted to make sure I understand how it worked. Sounds like the 
additional mapping vs super column method will work better for my purposes.

From: Benjamin Coverston [mailto:ben.covers...@datastax.com]
Sent: Tuesday, February 08, 2011 2:22 PM
To: user@cassandra.apache.org
Subject: Re: Subcolumn Indexing

Does this just mean the exhaustive list of the column names not all the values?
No, this means the entire supercolumn, names and values. When the client tries 
to access any subcolumn in the supercolumn it has to read the entire 
supercolumn.

So if I have a super column that has a map of keys that only contain two 
columns max each this shouldn't really be a performance concern correct? This 
becomes an issue when you have lots of subcolumns if I'm reading this 
correctly? I'm looking at using the super column as a good way to cluster data, 
say I was storing home addresses I might use the zipcode as the super column if 
I cared mostly about accessing data by logical area for instance. Thanks for 
the help.
That is one way to logically group the values, but I think that a simpler 
solution may be to store the home address as a single row in a column family, 
then use a dynamic column family to store references to those addresses per zip 
code.

jt
___

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By 
messaging with Barclays you consent to the foregoing.  Barclays Capital is the 
investment banking division of Barclays Bank PLC, a company registered in 
England (number 1026167) with its registered office at 1 Churchill Place, 
London, E14 5HP.  This email may relate to or be sent from other members of the 
Barclays Group.
___

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

I did read those articles, but I didn't know know that deleting all
the columns on a row was equivalent to deleting the row. Like I
mentioned, I did delete all the columns from all my rows and then
forced compaction before and after gc_grace had passed, but all the
rows still exist. If they never disappear, then won't I run out of
resources eventually?

-Kal

On Tue, Feb 8, 2011 at 3:09 PM, Jeremiah Jordan
 wrote:
> You will have the same problem.  You just have to learn to ignore empty rows 
> when you query data.  See articles on delete mentioned earlier.
>
 >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
 >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
>
> -Original Message-
> From: Kallin Nagelberg [mailto:kallin.nagelb...@gmail.com]
> Sent: Tuesday, February 08, 2011 1:36 PM
> To: user@cassandra.apache.org
> Subject: Re: time to live rows
>
> I'm thinking if this row expiry notion doesn't pan out then I might
> create a 'lastAccessed' column with a secondary index (i think that's
> right) on it. Then I can periodically run a query to find all
> lastAccessed columns less than a certain value and manually delete
> them. Sound reasonable?
>
> -Kal
>
> On Tue, Feb 8, 2011 at 12:09 PM, Kallin Nagelberg
>  wrote:
>> Yes I did, on the org.apache.cassandra.db.ColumnFamilies.Main.Session object.
>>
>> -Kal
>>
>> On Tue, Feb 8, 2011 at 12:00 PM, Sylvain Lebresne  
>> wrote:
>>> Did you force a major compaction (with jconsole or nodetool) after gc_grace
>>> has elapsed ?
>>> On Tue, Feb 8, 2011 at 5:46 PM, Kallin Nagelberg
>>>  wrote:

 Thanks, gc_grace works in the CLI.

 However, I'm not observing the desired effect. I am setting TTL on a
 single column in my column family, and I see the columns disappear
 when using 'list Session' (my columnfamily) in the CLI. I created the
 column family with gc_grace = 60, and after observing for a few
 minutes I am still seeing all the rows come back, none of them with
 columns. I was hoping the GC would delete the empty rows.

 -Kal

 On Tue, Feb 8, 2011 at 11:39 AM, Sylvain Lebresne 
 wrote:
 > Not very logically, It's actually gc_grace, not gc_grace_seconds in the
 > CLI.
 >
 > On Tue, Feb 8, 2011 at 5:34 PM, Kallin Nagelberg
 >  wrote:
 >>
 >> I'm trying to set the gc_grace_seconds column family parameter but no
 >> luck.. I got the name of it from the comment in cassandra.yaml:
 >>
 >> #     - gc_grace_seconds: specifies the time to wait before garbage
 >> #        collecting tombstones (deletion markers). defaults to 864000
 >> (10
 >> #        days). See http://wiki.apache.org/cassandra/DistributedDeletes
 >>
 >> create column family Session
 >>    with comparator = UTF8Type
 >>    and keys_cached = 1
 >>    and memtable_flush_after = 1440
 >>    and memtable_throughput = 32
 >>        and gc_grace_seconds = 60;
 >>
 >> error is 'No enum const class
 >>
 >>
 >> org.apache.cassandra.cli.CliUserHelp$ColumnFamilyArgument.GC_GRACE_SECONDS'.
 >>
 >> Thanks,
 >> -Kal
 >>
 >> On Tue, Feb 8, 2011 at 11:02 AM, Sylvain Lebresne
 >> 
 >> wrote:
 >> >> I hope you don't consider this a hijack of the thread...
 >> >>
 >> >> What I'd like to know is the following:
 >> >>
 >> >> The GC removes TTL rows some time after they expire, at its
 >> >> convenience.
 >> >> But will they stop being returned as soon as they expire? (This is
 >> >> the
 >> >> expected behavior...)
 >> >
 >> > It is the individual column that have TTL. When a column expires, it
 >> > becomes
 >> > a delete tombstone. Now, a row with tombstones (even only them) will
 >> > show
 >> > during range request. But the explanation is
 >> > here: http://wiki.apache.org/cassandra/FAQ#range_ghosts
 >> >
 >> >>
 >> >> On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg
 >> >>  wrote:
 >> >>>
 >> >>> So the empty row will be ultimately removed then? Is there a way to
 >> >>> for the GC to verify this?
 >> >>>
 >> >>> Thanks,
 >> >>> -Kal
 >> >>>
 >> >>> On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood  wrote:
 >> >>> > The expired columns were converted into tombstones, which will
 >> >>> > live
 >> >>> > for
 >> >>> > the
 >> >>> > GC timeout. The "empty" row will be cleaned up when those
 >> >>> > tombstones
 >> >>> > are
 >> >>> > removed.
 >> >>> > Returning the empty row is unfortunate... we'd love to find a
 >> >>> > more
 >> >>> > appropriate solution that might not involve endless scanning.
 >> >>> > See
 >> >>> > http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives
 >> >>> > http://wiki.apache.org/cassandra/FAQ#range_ghosts
 >> >>> >
 >> >>> > On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg
 >> >>> >  wrote:

Re: Finding the intersection results of column sets of two rows

2011-02-08 Thread Aaron Morton

Makes sense, use a get_slice() against the second row and pass in the column 
names. Should e fine.

If you run into performance issues look at slice_buffer_size and 
column_index_size in the config.

Aaron


On 9/02/2011, at 5:16 AM, Aklin_81  wrote:

> Amongst two rows, where I need to find the common columns. I will not
> have more than 200 columns(in 99% cases) for the 1st row. But the 2nd
> row where I need to find these columns may have even around a million
> valueless columns.
> 
> A point to note is:- These calculations are all done for **writing the
> data to the database that has been collected from presentation layer**
> & not while presentation of data.
> 
> I am using the results of such intersection to find the rows(that are
> pointed by names of common columns) that I should write to. The
> calculations are done after a Post is submitted by a user, in a
> discussions forum. Actually this is used to find out the mutual
> connections in a group & write to the rows pointed by common columns.
> 1st row represents the connection list of a user, which is not going
> to be more than 100-250 columns for my case & 2nd row represents the
> members of a group which may contain a million columns as I told.
> I find the mutual connections in a group(by finding the common columns
> in the above two rows) and then write to the rows of those users.
> 
> Cant I run a batch query to ask for all columns that I picked up from
> 1st row and want to ask in the 2nd row ??
> 
> Is there any better way ?
> 
> Asil
> 
> 
>> 
>> On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote:
>> 
>>> Thanks Aaron & Shaun,
>>> 
>>> **
>>> I think my question might have been unclear to some of you. So I would
>>> again explain my problem(& solution which I thought of) for the sake
>>> of clarity:-
>>> 
>>> Consider I have 2 rows.  1st row contains 60-70 columns and 2nd row
>>> contains like in hundreds of thousands columns. Both the columns sets
>>> are all valueless. I need to just findout the **common column names**
>>> in the two rows. **These two rows are known to me**. So what I plan to
>>> do is, I just pick up all **columns (names)** of 1st row (60 -70
>>> columns) and just ask for them in 2nd row, whatever column names I get
>>> back is my result.
>>> Would there be any problem with this solution ? This is how I am
>>> expecting to get common column names.
>>> 
>>> Please do not consider it as a JOIN case as it leads to unnecessary
>>> confusions, I just need common column names from valueless columns in
>>> the two rows.
>>> 
>>> 
>>> 
>>> Aaron, actually the intersection data is very much context based. So
>>> say if there are 10 million rows in CF A & 1 million in CF B, then
>>> intersection data would be containing 10 million *1 million rows. This
>>> would involve very huge & unaffordable amounts of denormalization.
>>> And finding columns in client would require pulling unnecessary
>>> columns like pulling 100,000 columns from a row of which only 60-70
>>> are required .
>>> 
>>> Shaun, I hope my above clarification has clarified things a bit. Yes,
>>> the rows, of which I need to find common columns are known to me.
>>> 
>>> 
>>> Thank you all,
>>> Asil
>>> 
>>> 
>>> On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts  wrote:
 In theory, you should be able to do joins by creating an extra column in 
 one column family, holding the "foreign key" of the matching row in the 
 other family.
 
 This assumes that the info you are joining on is available in both CFs (is 
 not some sort of functional transformation).
 
 I have just found that the implementation for secondary indexes is not yet 
 very close to optimal for more complex "joins" involving multiple indexes, 
 I'm not sure if that affects you as you didn't say what you are joining on.
 
 -- Shaun
 
 
 On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote:
 
> Is it possible for you to dernormalise and write all the intersection 
> values? Will depend on how many I guess.
> 
> The other alternative is to pull back more data that you need and the 
> intersection in code in the client.
> 
> 
> Hope that helps.
> Aaron
> On 7/02/2011, at 7:11 AM, Aklin_81  wrote:
> 
>> Hi,
>> 
>> @buddhasystem : yes that's well known solution. But obviously when
>> mysql couldnt satisfy my needs, I am here. My question is in context
>> of Cassandra, if it possible to achieve intersection result set of
>> columns in two rows, by the way I spoke about.
>> 
>> @Edward: yes that I know but how does that fit here for obtaining the
>> common columns among two rows.
>> 
>> Thanks for your comments..
>> 
>> -Asil
>> 
>> 
>> On Sun, Feb 6, 2011 at 9:55 PM, Edward Capriolo  
>> wrote:
>>> On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem  wrote:
 
 Hello,
 
>>>

Re: time to live rows

2011-02-08 Thread Benjamin Coverston



On 2/8/11 1:23 PM, Kallin Nagelberg wrote:

I did read those articles, but I didn't know know that deleting all
the columns on a row was equivalent to deleting the row. Like I
mentioned, I did delete all the columns from all my rows and then
forced compaction before and after gc_grace had passed, but all the
rows still exist. If they never disappear, then won't I run out of
resources eventually?

-Kal

You would, if there weren't a way to get rid of tombstones:

http://wiki.apache.org/cassandra/DistributedDeletes

--
Ben Coverston
DataStax -- The Apache Cassandra Company

Re: Finding the intersection results of column sets of two rows

2011-02-08 Thread Aklin_81

Thank you so much Aaron !!

On Wed, Feb 9, 2011 at 2:11 AM, Aaron Morton  wrote:
> Makes sense, use a get_slice() against the second row and pass in the column 
> names. Should e fine.
>
> If you run into performance issues look at slice_buffer_size and 
> column_index_size in the config.
>
> Aaron
>
>
> On 9/02/2011, at 5:16 AM, Aklin_81  wrote:
>
>> Amongst two rows, where I need to find the common columns. I will not
>> have more than 200 columns(in 99% cases) for the 1st row. But the 2nd
>> row where I need to find these columns may have even around a million
>> valueless columns.
>>
>> A point to note is:- These calculations are all done for **writing the
>> data to the database that has been collected from presentation layer**
>> & not while presentation of data.
>>
>> I am using the results of such intersection to find the rows(that are
>> pointed by names of common columns) that I should write to. The
>> calculations are done after a Post is submitted by a user, in a
>> discussions forum. Actually this is used to find out the mutual
>> connections in a group & write to the rows pointed by common columns.
>> 1st row represents the connection list of a user, which is not going
>> to be more than 100-250 columns for my case & 2nd row represents the
>> members of a group which may contain a million columns as I told.
>> I find the mutual connections in a group(by finding the common columns
>> in the above two rows) and then write to the rows of those users.
>>
>> Cant I run a batch query to ask for all columns that I picked up from
>> 1st row and want to ask in the 2nd row ??
>>
>> Is there any better way ?
>>
>> Asil
>>
>>
>>>
>>> On Feb 7, 2011, at 12:30 AM, Aklin_81 wrote:
>>>
 Thanks Aaron & Shaun,

 **
 I think my question might have been unclear to some of you. So I would
 again explain my problem(& solution which I thought of) for the sake
 of clarity:-

 Consider I have 2 rows.  1st row contains 60-70 columns and 2nd row
 contains like in hundreds of thousands columns. Both the columns sets
 are all valueless. I need to just findout the **common column names**
 in the two rows. **These two rows are known to me**. So what I plan to
 do is, I just pick up all **columns (names)** of 1st row (60 -70
 columns) and just ask for them in 2nd row, whatever column names I get
 back is my result.
 Would there be any problem with this solution ? This is how I am
 expecting to get common column names.

 Please do not consider it as a JOIN case as it leads to unnecessary
 confusions, I just need common column names from valueless columns in
 the two rows.

 Aaron, actually the intersection data is very much context based. So
 say if there are 10 million rows in CF A & 1 million in CF B, then
 intersection data would be containing 10 million *1 million rows. This
 would involve very huge & unaffordable amounts of denormalization.
 And finding columns in client would require pulling unnecessary
 columns like pulling 100,000 columns from a row of which only 60-70
 are required .

 Shaun, I hope my above clarification has clarified things a bit. Yes,
 the rows, of which I need to find common columns are known to me.

 Thank you all,
 Asil

 On Mon, Feb 7, 2011 at 3:53 AM, Shaun Cutts  wrote:
> In theory, you should be able to do joins by creating an extra column in 
> one column family, holding the "foreign key" of the matching row in the 
> other family.
>
> This assumes that the info you are joining on is available in both CFs 
> (is not some sort of functional transformation).
>
> I have just found that the implementation for secondary indexes is not 
> yet very close to optimal for more complex "joins" involving multiple 
> indexes, I'm not sure if that affects you as you didn't say what you are 
> joining on.
>
> -- Shaun
>
>
> On Feb 6, 2011, at 4:22 PM, Aaron Morton wrote:
>
>> Is it possible for you to dernormalise and write all the intersection 
>> values? Will depend on how many I guess.
>>
>> The other alternative is to pull back more data that you need and the 
>> intersection in code in the client.
>>
>>
>> Hope that helps.
>> Aaron
>> On 7/02/2011, at 7:11 AM, Aklin_81  wrote:
>>
>>> Hi,
>>>
>>> @buddhasystem : yes that's well known solution. But obviously when
>>> mysql couldnt satisfy my needs, I am here. My question is in context
>>> of Cassandra, if it possible to achieve intersection result set of
>>> columns in two rows, by the way I spoke about.
>>>
>>> @Edward: yes that I know but how does that fit here for obtaining the
>>> common columns among two rows.
>>>
>>> Thanks for your comments..
>>>
>>> -Asil
>

Re: time to live rows

2011-02-08 Thread Kallin Nagelberg

What's the secret recipe that I'm missing? I tried forcing compaction
on my column family's JMX bean
(org.apache.cassandra.db.ColumnFamilies.Main.Session) in jconsole,
after gc_grace had passed (i set it to 60).

Thanks,
-Kal

On Tue, Feb 8, 2011 at 3:46 PM, Benjamin Coverston
 wrote:
>
> On 2/8/11 1:23 PM, Kallin Nagelberg wrote:
>>
>> I did read those articles, but I didn't know know that deleting all
>> the columns on a row was equivalent to deleting the row. Like I
>> mentioned, I did delete all the columns from all my rows and then
>> forced compaction before and after gc_grace had passed, but all the
>> rows still exist. If they never disappear, then won't I run out of
>> resources eventually?
>>
>> -Kal
>
> You would, if there weren't a way to get rid of tombstones:
>
> http://wiki.apache.org/cassandra/DistributedDeletes
>
> --
> Ben Coverston
> DataStax -- The Apache Cassandra Company
>
>

error casting Column to SuperColumn during compaction. ? CASSANDRA-1992 ?

2011-02-08 Thread Aaron Morton

I got the error below on an newish 0.7.0 cluster with the following...- no schema changes. - original RF at 1, changed to 3 via cassandra-cli and repair run- stable node membership, i.e. no nodes added Was thinking it may have to do with  CASSANDRA-1992 (see http://www.mail-archive.com/user@cassandra.apache.org/msg09276.html) but I've not seen this error associated with that issue before. I can apply the patch or run the head on the 0.7 branch if that will help. May be able to dig into it further later today. (not ruling out me doing stupid things yet)AaronERROR [CompactionExecutor:1] 2011-02-08 16:59:35,380 AbstractCassandraDaemon.java (line org.apache.cassandra.service.AbstractCassandraDaemon) Fatal exception in thread Thread[CompactionExecutor:1,1,main]java.lang.ClassCastException: org.apache.cassandra.db.Column cannot be cast to org.apache.cassandra.db.SuperColumn        at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:329)        at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:313)        at org.apache.cassandra.dbColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)        at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:106)        at org.apache.cassandra.io.PrecompactedRow.(PrecompactedRow.java:97)        at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:138)        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107)        at org.apachecassandra.io.CompactionIterator.getReduced(CompactionIterator.java:42)        at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)        at org.apache.commons.collections.iterators.FilterIterator.setNextObjecand such starting t(FilterIterator.java:183)        at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)        at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:323)        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:122)        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:92)        at java.utilconcurrent.FutureTask$Sync.innerRun(FutureTask.java:303)        at java.util.concurrent.FutureTask.run(FutureTask.java:138)        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)        at java.lang.Thread.run(Thread.java:619)Then after restart started getting...ERROR [CompactionExecutor:1] 2011-02-09 10:39:15,496 PrecompactedRow.java (line org.apache.cassandra.io.PrecompactedRow.(PrecompactedRow.java:82)) Skipping row DecoratedKey(71445620198865609512646197760056087250, 2f77686174696674686973776572656d617374657266696c652f746e6e5f73686f74735f3035305f3030352f6d664964) in /local1/cassandra/data/junkbox/ObjectIndex-e-216-Data.dbjava.io.IOException: Corrupt (negative) value length encountered        at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:274)        at orgapache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94)        at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)        at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:137)        at org.apache.cassandra.io.PrecompactedRow.(PrecompactedRow.java:78)        at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:138)        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107)        at org.apache.cassandraio.CompactionIterator.getReduced(CompactionIterator.java:42)        at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)        at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)        at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)        at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:323)        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:122)        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:92)        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)        at java.util.concurrent.FutureTask.run(FutureTask.java:138)        at

Re: Cassandra memory consumption

2011-02-08 Thread Aaron Morton

When you attach to the JVM with JConsole how much non heap memory and how much heap memory is reported on the memory tab?Xmx controls the total size of the heap memory, which excludes the permanent generation. seehttp://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#generation_sizingandhttp://blogs.suncom/jonthecollector/entry/presenting_the_permanent_generationTotal non-heap memory on a 0.7 box I have is around 27M. You numbers seem large but it would be interesting to know what the JVM is reporting.AaronOn 09 Feb, 2011,at 05:57 AM, Victor Kabdebon  wrote:Information on the system :Debian 5Jvm :victor@testhost:~/database/apache-cassandra-0.6.6$ java -versionjava version "1.6.0_22"Java(TM) SE Runtime Environment (build 1.6.0_22-b04)


Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)RAM : 2Go2011/2/8 Victor Kabdebon 


Sorry Jonathan :So most of these informations were taken using the command : 


sudo ps aux | grep cassandraFor the nodetool information it is :
/bin/nodetool --host localhost --port 8081 infoRegars,Victor K.2011/2/8 Jonathan Ellis 



I missed the part where you explained where you're getting your numbers from.

On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
 wrote:
> It is really weird that I am the only one to have this issue.
> I restarted Cassandra today and already the memory compution is over the
> limit :
>
> root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
> -Dcom.sun.management.jmxremotessl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-06.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/./lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar




> org.apache.cassandra.thrift.CassandraDaemon
>
> It is really an annoying problem if we cannot really foresee memory
> consumption.
>
> Best regards,
> Victor K
>
> 2011/2/8 Victor Kabdebon 
>>
>> Dear all,
>>
>> Sorry to come back again to this point but I am really worried about
>> Cassandra memory consumption. I have a single machine that runs one
>> Cassandra server. There is almost no data on it but I see a crazy memory
>> consumption and it doesn't care at all about the instructions...
>> Note that I am not using mmap, but "Standard", I use also JNA (inside lib
>> folder), i am running on debian 5 64 bits, so a pretty normal configuration.
>> I also use Cassandra 0.6.8.
>>
>>
>> Here are the informations I gathered on Cassandra :
>>
>> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58I think you are 
>> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sunmanagement.jmxremote.port=8081
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.20-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/./lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4

Re: Cassandra memory consumption

2011-02-08 Thread Victor Kabdebon

I will do that in the future and I will post my results here ( I upgraded
the server to debian 6 to see if there is any change, so memory is back to
normal). I will report in a few days.
In the meantime I am open to any suggestion...

2011/2/8 Aaron Morton 

> When you attach to the JVM with JConsole how much non heap memory and how
> much heap memory is reported on the memory tab?
>
> Xmx controls the total size of the heap memory, which excludes the
> permanent generation.
> see
>
> http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#generation_sizing
> and
>
> http://blogs.suncom/jonthecollector/entry/presenting_the_permanent_generation
>
> 
> Total non-heap memory on a 0.7 box I have is around 27M. You numbers seem
> large but it would be interesting to know what the JVM is reporting.
>
> Aaron
>
> On 09 Feb, 2011,at 05:57 AM, Victor Kabdebon 
> wrote:
>
> Information on the system :
>
> *Debian 5*
> *Jvm :*
> victor@testhost:~/database/apache-cassandra-0.6.6$ java -version
> java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
>
> *RAM :* 2Go
>
>
> 2011/2/8 Victor Kabdebon 
>
>> Sorry Jonathan :
>>
>> So most of these informations were taken using the command :
>>
>> sudo ps aux | grep cassandra
>>
>> For the nodetool information it is :
>>
>> /bin/nodetool --host localhost --port 8081 info
>>
>>
>> Regars,
>>
>> Victor K.
>>
>>
>> 2011/2/8 Jonathan Ellis 
>>
>>
>> I missed the part where you explained where you're getting your numbers
>>> from.
>>>
>>>
>>> On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
>>>  wrote:
>>> > It is really weird that I am the only one to have this issue.
>>> > I restarted Cassandra today and already the memory compution is over
>>> the
>>> > limit :
>>> >
>>> > root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
>>> > /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC
>>> > -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
>>> -XX:MaxTenuringThreshold=1
>>> > -XX:CMSInitiatingOccupancyFraction=75
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> > -XX:+HeapDumpOnOutOfMemoryError
>>> -Dcom.sun.management.jmxremote.port=8081
>>> > -Dcom.sun.management.jmxremotessl=false
>>>
>>> > -Dcom.sun.management.jmxremote.authenticate=false
>>> > -Dstorage-config=bin/../conf -cp
>>> >
>>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-06.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/./lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
>>>
>>> > org.apache.cassandra.thrift.CassandraDaemon
>>> >
>>> > It is really an annoying problem if we cannot really foresee memory
>>> > consumption.
>>> >
>>> > Best regards,
>>> > Victor K
>>> >
>>> > 2011/2/8 Victor Kabdebon 
>>> >>
>>> >> Dear all,
>>> >>
>>> >> Sorry to come back again to this point but I am really worried about
>>> >> Cassandra memory consumption. I have a single machine that runs one
>>> >> Cassandra server. There is almost no data on it but I see a crazy
>>> memory
>>> >> consumption and it doesn't care at all about the instructions...
>>> >> Note that I am not using mmap, but "Standard", I use also JNA (inside
>>> lib
>>> >> folder), i am running on debian 5 64 bits, so a pretty normal
>>> configuration.
>>> >> I also use Cassandra 0.6.8.
>>> >>
>>> >>
>>> >> Here are the informations I gathered on Cassandra :
>>> >>
>>> >> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58I
>>> think you are
>>>
>>> >> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC
>>> >> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
>>> -XX:MaxTenuringThreshold=1
>>> >> -XX:CMSInitiatingOccupancyFraction=75
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> >> -XX:+HeapDumpOnOutOfMemoryError
>>> -Dcom.sunmanagement.jmxremote.port=8081
>>>
>>> >> -Dcom.sun.management.jmxremote.ssl=false
>>> >> -Dcom.sun.management.jmxremote.authenticate=false
>>> >> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
>>> >>
>>> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apach

Re: error casting Column to SuperColumn during compaction. ? CASSANDRA-1992 ?

2011-02-08 Thread Jonathan Ellis

Looks like https://issues.apache.org/jira/browse/CASSANDRA-1992.

On Tue, Feb 8, 2011 at 3:40 PM, Aaron Morton  wrote:
> I got the error below on an newish 0.7.0 cluster with the following...
> - no schema changes.
> - original RF at 1, changed to 3 via cassandra-cli and repair run
> - stable node membership, i.e. no nodes added
> Was thinking it may have to do with  CASSANDRA-1992
> (see http://www.mail-archive.com/user@cassandra.apache.org/msg09276.html)
> but I've not seen this error associated with that issue before. I can apply
> the patch or run the head on the 0.7 branch if that will help.
> May be able to dig into it further later today. (not ruling out me doing
> stupid things yet)
> Aaron
> ERROR [CompactionExecutor:1] 2011-02-08 16:59:35,380
> AbstractCassandraDaemon.java (line
> org.apache.cassandra.service.AbstractCassandraDaemon) Fatal exception in
> thread Thread[CompactionExecutor:1,1,main]
> java.lang.ClassCastException: org.apache.cassandra.db.Column cannot be cast
> to org.apache.cassandra.db.SuperColumn
>         at
> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:329)
>         at
> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:313)
>         at
> org.apache.cassandra.dbColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
>         at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:106)
>         at
> org.apache.cassandra.io.PrecompactedRow.(PrecompactedRow.java:97)
>         at
> org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:138)
>         at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107)
>         at
> org.apachecassandra.io.CompactionIterator.getReduced(CompactionIterator.java:42)
>         at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>         at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>         at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>         at
> org.apache.commons.collections.iterators.FilterIterator.setNextObjecand such
> starting t(FilterIterator.java:183)
>         at
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>         at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:323)
>         at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:122)
>         at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:92)
>         at java.utilconcurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
>
> Then after restart started getting...
>
> ERROR [CompactionExecutor:1] 2011-02-09 10:39:15,496 PrecompactedRow.java
> (line
> org.apache.cassandra.io.PrecompactedRow.(PrecompactedRow.java:82))
> Skipping row DecoratedKey(71445620198865609512646197760056087250,
> 2f77686174696674686973776572656d617374657266696c652f746e6e5f73686f74735f3035305f3030352f6d664964)
> in /local1/cassandra/data/junkbox/ObjectIndex-e-216-Data.db
> java.io.IOException: Corrupt (negative) value length encountered
>         at
> org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:274)
>         at
> orgapache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94)
>         at
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
>         at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
>         at
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:137)
>         at
> org.apache.cassandra.io.PrecompactedRow.(PrecompactedRow.java:78)
>         at
> org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:138)
>         at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107)
>         at
> org.apache.cassandraio.CompactionIterator.getReduced(CompactionIterator.java:42)
>         at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>         at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>         at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>         at
> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>         at
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>         at
> org.apache.cassandra.db.CompactionManager.doCompacti

cassandra memory is huge

2011-02-08 Thread Blaze Dowis

Why is it that when I start cassandra, it is taking up to 1G of memory? and
how can I lessen this? here is a small portion of the startup dump.

INFO 12:33:45,539 Creating new commitlog segment
/var/lib/cassandra/commitlog/CommitLog-1297208025539.log
 INFO 12:34:00,034 switching in a fresh Memtable for 10Level at
CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1297208025539.log',
position=16679008)
 INFO 12:34:00,034 Enqueuing flush of Memtable-10Level@2030115017(18327149
bytes, 584909 operations)
 INFO 12:34:00,035 Writing Memtable-10Level@2030115017(18327149 bytes,
584909 operations)
 INFO 12:34:04,186 Completed flushing
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-54-Data.db (53616697 bytes)
 INFO 12:34:04,187 Discarding obsolete commit
log:CommitLogSegment(/var/lib/cassandra/commitlog/CommitLog-1297207905966.log)
Killed
EC\bdow009@gl4-499-b4-linux:~$
Desktop/WorkCassandra/apache-cassandra-0.7.0-rc1/bin/cassandra -f
 INFO 12:40:07,820 Heap size: 1987182592/2006843392
 WARN 12:40:08,007 Unable to lock JVM memory (ENOMEM). This can result in
part of the JVM being swapped out, especially with mmapped I/O enabled.
Increase RLIMIT_MEMLOCK or run Cassandra as root.
 INFO 12:40:08,037 Loading settings from
file:/home/bdow009/Desktop/WorkCassandra/apache-cassandra-0.7.0-rc1/conf/cassandra.yaml
 INFO 12:40:08,170 DiskAccessMode 'auto' determined to be mmap,
indexAccessMode is mmap
 INFO 12:40:08,279 Creating new commitlog segment
/var/lib/cassandra/commitlog/CommitLog-1297208408279.log
 INFO 12:40:08,388 Deleted
/var/lib/cassandra/data/system/LocationInfo-e-143-<>
 INFO 12:40:08,398 Deleted
/var/lib/cassandra/data/system/LocationInfo-e-144-<>
 INFO 12:40:08,399 Deleted
/var/lib/cassandra/data/system/LocationInfo-e-145-<>
 INFO 12:40:08,404 Deleted
/var/lib/cassandra/data/system/LocationInfo-e-147-<>
 INFO 12:40:08,405 Deleted
/var/lib/cassandra/data/system/LocationInfo-e-146-<>
 INFO 12:40:08,464 read 1 from saved key cache
 INFO 12:40:08,468 Sampling index for
/var/lib/cassandra/data/system/IndexInfo-e-1-<>
 INFO 12:40:08,507 read 5 from saved key cache
 INFO 12:40:08,509 Sampling index for
/var/lib/cassandra/data/system/Schema-e-1166-<>
 INFO 12:40:08,572 Sampling index for
/var/lib/cassandra/data/system/Schema-e-1167-<>
 INFO 12:40:08,581 Sampling index for
/var/lib/cassandra/data/system/Schema-e-1168-<>
 INFO 12:40:08,596 read 0 from saved key cache
 INFO 12:40:08,597 Sampling index for
/var/lib/cassandra/data/system/Migrations-e-1169-<>
 INFO 12:40:08,604 read 1 from saved key cache
 INFO 12:40:08,605 Sampling index for
/var/lib/cassandra/data/system/LocationInfo-e-148-<>
 INFO 12:40:08,637 read 0 from saved key cache
 INFO 12:40:08,642 loading row cache for LocationInfo of system
 INFO 12:40:08,642 completed loading (0 ms; 0 keys)  row cache for
LocationInfo of system
 INFO 12:40:08,643 loading row cache for HintsColumnFamily of system
 INFO 12:40:08,643 completed loading (0 ms; 0 keys)  row cache for
HintsColumnFamily of system
 INFO 12:40:08,643 loading row cache for Migrations of system
 INFO 12:40:08,644 completed loading (0 ms; 0 keys)  row cache for
Migrations of system
 INFO 12:40:08,644 loading row cache for Schema of system
 INFO 12:40:08,644 completed loading (0 ms; 0 keys)  row cache for Schema of
system
 INFO 12:40:08,645 loading row cache for IndexInfo of system
 INFO 12:40:08,645 completed loading (1 ms; 0 keys)  row cache for IndexInfo
of system
 INFO 12:40:08,724 Loading schema version
b27346aa-33d3-11e0-93bb-e700f669bcfc
 WARN 12:40:08,940 Schema definitions were defined both locally and in
cassandra.yaml. Definitions in cassandra.yaml were ignored.
 INFO 12:40:09,105 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-39-<>
 INFO 12:40:09,106 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-43-<>
 INFO 12:40:09,106 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-40-<>
 INFO 12:40:09,106 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-38-<>
 INFO 12:40:09,107 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-27-<>
 INFO 12:40:09,107 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-37-<>
 INFO 12:40:09,107 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-28-<>
 INFO 12:40:09,108 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-15-<>
 INFO 12:40:09,108 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-48-<>
 INFO 12:40:09,108 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-23-<>
 INFO 12:40:09,109 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-34-<>
 INFO 12:40:09,109 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-49-<>
 INFO 12:40:09,110 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-20-<>
 INFO 12:40:09,110 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-44-<>
 INFO 12:40:09,110 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-33-<>
 INFO 12:40:09,111 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-51-<>
 INFO 12:40:09,111 Deleted
/var/lib/cassandra/data/SUIDDataKsp/10Level-e-21-<>
 INF

Re: cassandra memory is huge

2011-02-08 Thread Joshua Partogi

Do you have loads of data? 1GB is quite reasonable knowing that 8GB is
the recommended RAM size
http://wiki.apache.org/cassandra/CassandraHardware

Kind regards,
Joshua.

On Wed, Feb 9, 2011 at 10:48 AM, Blaze Dowis  wrote:
> Why is it that when I start cassandra, it is taking up to 1G of memory? and
> how can I lessen this? here is a small portion of the startup dump.

-- 
http://twitter.com/jpartogi

Re: cassandra memory is huge

2011-02-08 Thread Aaron Morton

the JVM heap size is set in conf/cassandra-env.shIf not set it will use half the system memory. AaronOn 09 Feb, 2011,at 01:33 PM, Joshua Partogi  wrote:Do you have loads of data? 1GB is quite reasonable knowing that 8GB is
the recommended RAM size
http://wiki.apache.org/cassandra/CassandraHardware

Kind regards,
Joshua.

On Wed, Feb 9, 2011 at 10:48 AM, Blaze Dowis  wrote:
> Why is it that when I start cassandra, it is taking up to 1G of memory? and
> how can I lessen this? here is a small portion of the startup dump.

-- 
http://twitter.com/jpartogi

Re: Cassandra memory consumption

2011-02-08 Thread Edward Capriolo

On Tue, Feb 8, 2011 at 4:56 PM, Victor Kabdebon
 wrote:
> I will do that in the future and I will post my results here ( I upgraded
> the server to debian 6 to see if there is any change, so memory is back to
> normal). I will report in a few days.
> In the meantime I am open to any suggestion...
>
> 2011/2/8 Aaron Morton 
>>
>> When you attach to the JVM with JConsole how much non heap memory and how
>> much heap memory is reported on the memory tab?
>> Xmx controls the total size of the heap memory, which excludes the
>> permanent generation.
>> see
>>
>> http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#generation_sizing
>> and
>>
>> http://blogs.suncom/jonthecollector/entry/presenting_the_permanent_generation
>> Total non-heap memory on a 0.7 box I have is around 27M. You numbers seem
>> large but it would be interesting to know what the JVM is reporting.
>> Aaron
>> On 09 Feb, 2011,at 05:57 AM, Victor Kabdebon 
>> wrote:
>>
>> Information on the system :
>>
>> Debian 5
>> Jvm :
>> victor@testhost:~/database/apache-cassandra-0.6.6$ java -version
>> java version "1.6.0_22"
>> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
>> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
>>
>> RAM : 2Go
>>
>>
>> 2011/2/8 Victor Kabdebon 
>>>
>>> Sorry Jonathan :
>>>
>>> So most of these informations were taken using the command :
>>>
>>> sudo ps aux | grep cassandra
>>>
>>> For the nodetool information it is :
>>>
>>> /bin/nodetool --host localhost --port 8081 info
>>>
>>>
>>> Regars,
>>>
>>> Victor K.
>>>
>>>
>>> 2011/2/8 Jonathan Ellis 
>>>
 I missed the part where you explained where you're getting your numbers
 from.


 On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
  wrote:
 > It is really weird that I am the only one to have this issue.
 > I restarted Cassandra today and already the memory compution is over
 > the
 > limit :
 >
 > root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
 > /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
 > -XX:+UseConcMarkSweepGC
 > -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
 > -XX:MaxTenuringThreshold=1
 > -XX:CMSInitiatingOccupancyFraction=75
 > -XX:+UseCMSInitiatingOccupancyOnly
 > -XX:+HeapDumpOnOutOfMemoryError
 > -Dcom.sun.management.jmxremote.port=8081
 > -Dcom.sun.management.jmxremotessl=false
 > -Dcom.sun.management.jmxremote.authenticate=false
 > -Dstorage-config=bin/../conf -cp
 >
 > bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-06.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/./lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
 > org.apache.cassandra.thrift.CassandraDaemon
 >
 > It is really an annoying problem if we cannot really foresee memory
 > consumption.
 >
 > Best regards,
 > Victor K
 >
 > 2011/2/8 Victor Kabdebon 
 >>
 >> Dear all,
 >>
 >> Sorry to come back again to this point but I am really worried about
 >> Cassandra memory consumption. I have a single machine that runs one
 >> Cassandra server. There is almost no data on it but I see a crazy
 >> memory
 >> consumption and it doesn't care at all about the instructions...
 >> Note that I am not using mmap, but "Standard", I use also JNA (inside
 >> lib
 >> folder), i am running on debian 5 64 bits, so a pretty normal
 >> configuration.
 >> I also use Cassandra 0.6.8.
 >>
 >>
 >> Here are the informations I gathered on Cassandra :
 >>
 >> 105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58I
 >> think you are
 >> /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
 >> -XX:+UseConcMarkSweepGC
 >> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
 >> -XX:MaxTenuringThreshold=1
 >> -XX:CMSInitiatingOccupancyFraction=75
 >> -XX:+UseCMSInitiatingOccupancyOnly
 >> -XX:+HeapDumpOnOutOfMemoryError
 >> -Dcom.sunmanagement.jmxremote.port=8081
 >> -Dcom.sun.management.jmxremote.ssl=false
 >> -Dcom.sun.management.jmxremote.authenticate=false
 >> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
 >>
 >> bin/../conf:bin/../build/classes:bin/../lib/antl

Re: Cassandra memory consumption

2011-02-08 Thread Victor Kabdebon

Yes I have, but I have to add that this is a server where there is so little
data (2.0 Mo of text, rougly a book) that even if there were an overhead due
to those things it would be minimal.
I don't understand what's eating up all that memory, is it because of Linux
that has difficulty getting rid of used memory ... I really am puzzled. (by
the way it is not a Amazon EC2 server this is a dedicated server).

Regards,
Victor K.

2011/2/8 Edward Capriolo 

> On Tue, Feb 8, 2011 at 4:56 PM, Victor Kabdebon
>  wrote:
> > I will do that in the future and I will post my results here ( I upgraded
> > the server to debian 6 to see if there is any change, so memory is back
> to
> > normal). I will report in a few days.
> > In the meantime I am open to any suggestion...
> >
> > 2011/2/8 Aaron Morton 
> >>
> >> When you attach to the JVM with JConsole how much non heap memory and
> how
> >> much heap memory is reported on the memory tab?
> >> Xmx controls the total size of the heap memory, which excludes the
> >> permanent generation.
> >> see
> >>
> >>
> http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#generation_sizing
> >> and
> >>
> >>
> http://blogs.suncom/jonthecollector/entry/presenting_the_permanent_generation
> >> Total non-heap memory on a 0.7 box I have is around 27M. You numbers
> seem
> >> large but it would be interesting to know what the JVM is reporting.
> >> Aaron
> >> On 09 Feb, 2011,at 05:57 AM, Victor Kabdebon  >
> >> wrote:
> >>
> >> Information on the system :
> >>
> >> Debian 5
> >> Jvm :
> >> victor@testhost:~/database/apache-cassandra-0.6.6$ java -version
> >> java version "1.6.0_22"
> >> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> >> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> >>
> >> RAM : 2Go
> >>
> >>
> >> 2011/2/8 Victor Kabdebon 
> >>>
> >>> Sorry Jonathan :
> >>>
> >>> So most of these informations were taken using the command :
> >>>
> >>> sudo ps aux | grep cassandra
> >>>
> >>> For the nodetool information it is :
> >>>
> >>> /bin/nodetool --host localhost --port 8081 info
> >>>
> >>>
> >>> Regars,
> >>>
> >>> Victor K.
> >>>
> >>>
> >>> 2011/2/8 Jonathan Ellis 
> >>>
>  I missed the part where you explained where you're getting your
> numbers
>  from.
> 
> 
>  On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
>   wrote:
>  > It is really weird that I am the only one to have this issue.
>  > I restarted Cassandra today and already the memory compution is over
>  > the
>  > limit :
>  >
>  > root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
>  > /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
>  > -XX:+UseConcMarkSweepGC
>  > -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
>  > -XX:MaxTenuringThreshold=1
>  > -XX:CMSInitiatingOccupancyFraction=75
>  > -XX:+UseCMSInitiatingOccupancyOnly
>  > -XX:+HeapDumpOnOutOfMemoryError
>  > -Dcom.sun.management.jmxremote.port=8081
>  > -Dcom.sun.management.jmxremotessl=false
>  > -Dcom.sun.management.jmxremote.authenticate=false
>  > -Dstorage-config=bin/../conf -cp
>  >
>  >
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-06.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/./lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
>  > org.apache.cassandra.thrift.CassandraDaemon
>  >
>  > It is really an annoying problem if we cannot really foresee memory
>  > consumption.
>  >
>  > Best regards,
>  > Victor K
>  >
>  > 2011/2/8 Victor Kabdebon 
>  >>
>  >> Dear all,
>  >>
>  >> Sorry to come back again to this point but I am really worried
> about
>  >> Cassandra memory consumption. I have a single machine that runs one
>  >> Cassandra server. There is almost no data on it but I see a crazy
>  >> memory
>  >> consumption and it doesn't care at all about the instructions...
>  >> Note that I am not using mmap, but "Standard", I use also JNA
> (inside
>  >> lib
>  >> folder), i am running on debian 5 64 bits, so a pretty normal
>  >> configuration.
>  >> I also use Cassandra 0.6.8.
>  >>
>  >>
>  >> Here are the informations I gathered on Cassandra :
>  >>
> >>

Re: Best way to detect/fix bitrot today?

2011-02-08 Thread Peter Schuller

> I should have clarified we have 3 copies, so in that case as long as 2 match
> we should be ok?

As far as I can think of, no. Whatever the reconciliation of two
columns results in, is what the cluster is expected to converge to. So
in the case of identical keys and mismatched values, tie breaking is
the deciding factor. There is no "global" comparison/voting process
between all nodes in the replicate set for the row.

> Even if there were checksumming at the SStable level, I assume it has to
> check and report these errors on compaction (or node repair)?

I believe that it would work minimally to just skip it (regular repair
would do the rest). However that said, there may be reasons to want to
do more. For example, if you have a cluster where you are relying on
QUORUM consistency then silently dropping data actively discovered to
be corrupt could be considered a consistency violation.

> I have seen some JIRA open on these issues ( 47 and 1717), but if I need
> something today, a read repair ( or a node repair) is the only viable
> option?

repair is needed anyway (unless you're use case is very unusual; no
deletes, no updates to pre-existing rows). But again to be clear,
neither repair nor read repair is primarily intended to address
arbitrary data corruption but rather to reach eventual consistency in
the cluster (after writes were dropped, a node went down, etc).

-- 
/ Peter Schuller

ApplicationState Schema has drifted from DatabaseDescriptor

2011-02-08 Thread Aaron Morton

I noticed this after I upgraded one node in a 0.7 cluster of 5 to the latest stable 0.7 build "2011-02-08_20-41-25" (upgraded  node was jb-cass1 below). This is a long email, you can jump to the end and help me out by checking something on your  07 cluster. This is the value from o.a.c.gms.FailureDetector.AllEndpointStates on jb-cass05 9114.67)/192.168.114.63   X3:2011-02-08_20-41-25   SCHEMA:2f555eb0-3332-11e0-9e8d-c4f8bbf76455   LOAD:2.84182972E8   STATUS:NORMAL,0/192.168.114.64   SCHEMA:2f555eb0-3332-11e0-9e8d-c4f8bbf76455   LOAD:2.84354156E8   STATUS:NORMAL,34028236692093846346337460743176821145 /192.168.114.66   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:2.59171601E8   STATUS:NORMAL,102084710076281539039012382229530463435 /192.168.114.65   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:2.70907168E8   STATUS:NORMAL,68056473384187692692674921486353642290 jb08.wetafx.co.nz/192.168.114.67   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:1.155260665E9   STATUS:NORMAL,136112946768375385385349842972707284580 Notice the schema for nodes 63 and 64 starts with 2f55 and for 65, 66 and 67 it starts with 075.This is the output from pycassa calling describe_versions when connected to both the 63 (jb-cass1) and 67 (jb-cass5) nodesIn [34]: sys.describe_schema_versions()Out[34]: {'2f555eb0-3332-11e0-9e8d-c4f8bbf76455': ['192.168.114.63',                                          '192.168.114.64',                                          '192.168.114.65',                                          '192.168.114.66',                                          '192.168.114.67']}It's reporting all nodes on the 2f55 schema. The SchemaCheckVerbHandler is getting the value from DatabaseDescriptor. FailureDetector MBean is getting them from Gossiper.endpointStateMap . Requests are working though, so the CFid's must be matching up. Commit https://github.com/apache/cassandra/commit/ecbd71f6b4bb004d26e585ca8a7e642436a5c1a4 added code to the 0.7 branch in the HintedHandOffManager to check the schema versions of nodes it has hints for. This is now failing on the new node as follows...ERROR [HintedHandoff:1] 2011-02-09 16:11:23,559 AbstractCassandraDaemon.java (line org.apache.cassandra.service.AbstractCassandraDaemon$1.uncaughtException(AbstractCassandraDaemon.java:114)) Fatal exception in thread Thread[HintedHandoff:1,1,main]java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /192.168.114.64 in 6ms        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)        at java.lang.Thread.run(Thread.java:619)Caused by: java.lang.RuntimeException: Could not reach schema agreement with /192.168114.64 in 6ms        at org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:256)        at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:267)        at org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:88)        at org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:391)        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)        ... 3 more(the nodes can all see each other, checked with notetool during the 60 seconds)If I restart one of the nodes with the 075 schema (without upgrading it) it reads the schema from the system tables and goes back to the 2f55 schema. e.g. the 64 node was also on the 075 schema, I restarted and it moved to the 2f55 and logged appropriately. While writing this email I checked again with the 65 node, and the schema if was reporting to other nodes changed after a restart from 075 to 2f55INFO [main] 2011-02-09 17:17:11,457 DatabaseDescriptor.java (line org.apache.cassandra.config.DatabaseDescriptor) Loading schema version 2f555eb0-3332-11e0-9e8d-c4f8bbf76455I've been reading the code for migrations and gossip don't have a theory as to what is going on. REQUEST FOR HELP: If you have a 0.7 cluster can you please check if this has happened so I can know this is a real problem or just an Aaron problem. You can check by...- getting the values from the o.a.c.gms.FailureDetector.AllEndPointStates- running describe_schema_versions via the API, here is how to do it via pycassa http://pycassa.github.com/pycassa/api/pycassa/system_manager.html?highlight=describe_schema_versions- checking at the schema ids' from the failure detector match the result from describe_schema_versions()- if they do not match can you also include some info on what sort of schema changes have happened on the box.ThanksAaron

Re: Best way to detect/fix bitrot today?

2011-02-08 Thread Peter Schuller

> One thing that we're doing for (guaranteed) immutable data is to use MD5
> signatures as keys... this will also prevent duplication, and it will allow
> detection (if not correction) of bitrot at the app level easy.

Yes. Another option is to checksum keys and/or values themselves by
effectively encoding each in a self-verifying format. But that makes
the data a lot more opaque to tools/humans.

Also consider that arbitrary data corruption could have other effects
than modifying a value or a key. I'm not sure the row skipping on
deserialization issues is good enough to handle absolutely arbitrary
corruption (anyone?).

-- 
/ Peter Schuller

regarding space taken by different column families in Cassandra

2011-02-08 Thread abhinav prakash rai

I am using 4 column family in my application , the result of cfstats for
space taken by different CF are as below-

CF1-Space used (live)  :7196159547
   Space used (total): 14214373706
CF2-   Space used (live)  :2456495851

   Space used (total): 9065746112

CF3-   Space used (live)  :2864007861

   Space used (total) :6114084611

CF4-   Space used (live)  :1531088094
   Space used (total) :3433016989

where as I can see the total size of data directory is 17GB which is not
equal to ALL Space used (total) by above 4 column families. If I assume Space
used (total) is in byte the sum is coming to about 32 GB which is not the
space taken by data_file_directories.

Some one can help to know how much space is used by each CF's ?

I am using replication_factor= 1.

Regards,
abhinav

61 matches

Mail list logo