Re: Storing big objects into columns

2011-01-14 Thread Peter Schuller
 In a project I would like to store big objects in columns, serialized. For
 example entire images (several Ko to several Mo), flash animations (several
 Mo) etc...
 Does someone use Cassandra with those relatively big columns and if yes does
 it work well ? Is there any drawbacks using this method ?

Not in production, but I've done testing with values on the order of
a few megs. Expect compaction to be entirely disk bound rather than
CPU bound. Make sure latency is acceptable even when data sizes grow
beyond memory size.

-- 
/ Peter Schuller


Re: Usage Pattern : amp;quot;uniqueamp;quot; value of a key.

2011-01-14 Thread Oleg Anastasyev
 
 You're right when you say it's unlikely that 2 threads have the same
 timestamp, but it can. So it could work for user creation, but maybe
 not on a more write intensive problem.

Um, sorry I thought you re solving exact case of duplicate user creation. If 
youre trying to solve the concurrent updates to cassandra in general, consider 
using zookeeper. By the way, lock algorithm in zookeeper is very much like you 
descibed - but zookeeper is the right tool for this job.

 
 Moreover, we cannot rely on fully time synchronized node in the
 cluster (but on node synchronized at a few ms), so a second node could
 theoretically write a smaller timestamp after the first node.

This is not a problem - then this node will loose the race - cassandra will 
ignore updates with timestamp older then timestamp of the current value.

 An even worst case could be the one illustrated here
 (http://noisette.ch/cassandra/cassandra_unique_key_pattern.png) :
 nodes are synchronized, but something goes wrong (slow) during the
 write, then both nodes think the key belongs to them.
 So my idea of writing a lock is not well suitablte following modification - 
when either user performs write{K,lock A}, it passes timestamp, recorded 
earlier 
- at the moment of performing very 1st read K.

So the scenario for user A is:
1. record current timestamp from machine clock - T1
2. make read K, K not exists
3. make write{K, lock A, timestamp = T1}
3.1 cassandra sees no current value in memtable for K - write succeeds. 
cassanda records timestamp of the value K,A to be T1
4. read K, compare lock to be A (for your original solution) or returned data 
timestamp == T1 (for proposed by me)

Then user B scenario would be:
1. record current timestamp from machine clock. It's value is T0, which is T1.
2. make read K, K not exists
3. slowness on: make a pause for couple of (milli)seconds, GCing or drinking 
coffee, so user A executes its scenario above
4. slowness off: make write{K,lock B, timestamp = T0}
4.1 on cassandra side, this write will be ignored, becase current timestamp of 
K 
is T1, which is later than T0
5. read K, see lock == A, instead of B (in your original solution) or timestamp 
!= T0 (in mine).
6. user B understands it is lost race. - do something with it.

Of course this scenario will work only, if clocks are synced and variation 
between machine clocks of user A and B is much less than duration of write-read 
roundtrip. In practice this means, that you'll need to introduce delays of 50-
100ms between write and last read on both users. So this could work if 100ms 
delay is acceptable.

Otherwise use zookeeper. At least until cassandra does not have version vector 
support implemented.

Umpf, that was long story ;-)



RE: about the data directory

2011-01-14 Thread raoyixuan (Shandy)
Thanks very much

-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: Friday, January 14, 2011 4:40 PM
To: user@cassandra.apache.org
Subject: Re: about the data directory

 as a administrator, I want to know why I can read the data from any node, 
 because the data just be kept the replica. Can you tell me? Thanks in advance.

It's part of the point of Cassandra. You talk to the cluster, period.
It's Cassandra's job to keep track of where data lives, and client
applications don't care. This is a fundamental design goal.

-- 
/ Peter Schuller


Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Roshan Dawrani
It's possible that I am misunderstanding the question in some way.

The row keys can be Time UUIDs and with those row keys as column names, u
can use comparator TIMEUUIDTYPE to have them sorted by time automatically.

On Fri, Jan 14, 2011 at 9:18 AM, Aaron Morton aa...@thelastpickle.comwrote:

 You could make the time an a fixed width integer and prefix your row keys
 with it, then set the comparotor to ascii or utf.

 Some issues:
 - Will you have time collisions ?
 - Not sure what your are storing in the super columns, but their are
 limitations http://wiki.apache.org/cassandra/CassandraLimitations
 http://wiki.apache.org/cassandra/CassandraLimitations- If you are using
 cassandra 0.7, have you looked at the secondary indexes ?
 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes

 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexesIf
 you provide some more info on the problem your trying to solve we may be
 able to help some more.

 Cheers
 Aaron


 On 14 Jan, 2011,at 04:27 PM, Aklin_81 asdk...@gmail.com wrote:

 I would like to keep the reference of other rows as names of super
 column and sort those super columns according to time.
 Is there any way I could implement that ?

 Thanks in advance!




-- 
Roshan
Blog: http://roshandawrani.wordpress.com/
Twitter: @roshandawrani http://twitter.com/roshandawrani
Skype: roshandawrani

   #
#
#   #


Re: Timeout Errors while running Hadoop over Cassandra

2011-01-14 Thread Jairam Chandar
The cassandra logs strangely show no errors at the time of failure.
Changing the RPCTimeoutInMillis seemed to help. Though it slowed down the
job considerably, it seems to be finishing by changing the timeout value
to 1 min. Unfortunately, I cannot be sure if it will continue to work if
the data increases further. Hopefully will be upgrading to the recently
released final version of 0.7.0.

Thanks for all the help and suggestions.

Warm regards,
Jairam Chandar

On 13/01/2011 14:47, Jeremy Hanna jeremy.hanna1...@gmail.com wrote:

On Jan 12, 2011, at 12:40 PM, Jairam Chandar wrote:

 Hi folks,
 
 We have a Cassandra 0.6.6 cluster running in production. We want to run
Hadoop (version 0.20.2) jobs over this cluster in order to generate
reports. 
 I modified the word_count example in the contrib folder of the
cassandra distribution. While the program is running fine for small
datasets (in the order of 100-200 MB) on small clusters (2 machines), it
starts to give errors while trying to run on a bigger cluster (5
machines) with much larger dataset (400 GB). Here is the error that we
get - 
 
 java.lang.RuntimeException: TimedOutException()
 at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeIni
t(ColumnFamilyRecordReader.java:186)
 at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeN
ext(ColumnFamilyRecordReader.java:236)
 at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeN
ext(ColumnFamilyRecordReader.java:104)
 at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractItera
tor.java:135)
 at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:
130)
 at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnF
amilyRecordReader.java:98)
 at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Map
Task.java:423)
 at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: TimedOutException()
 at 
org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassan
dra.java:11094)
 at 
org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassan
dra.java:628)
 at 
org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.j
ava:602)
 at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeIni
t(ColumnFamilyRecordReader.java:164)
 ... 11 more
 

I wonder if messing with RpcTimeoutInMillis in storage-conf.xml would
help. 

 
 
 
 I came across this page on the Cassandra wiki -
http://wiki.apache.org/cassandra/HadoopSupport and tried modifying the
ulimit and changing batch sizes. These did not help. Though the number
of successful map tasks increased, it eventually fails since the total
number of map tasks is huge.
 
 Any idea on what could be causing this? The program we are running is a
very slight modification of the word_count example with respect to
reading from Cassandra. The only change being specific keyspace,
columnfamily and columns. The rest of the code for reading is the same
as the word_count example in the source code for Cassandra 0.6.6.
 
 Thanks and regards,
 Jairam Chandar





Different comparator types for column and supercolumn don't work

2011-01-14 Thread Karin Kirsch
Hello,

I'm new to cassandra. I'm using cassandra release 0.7.0 (local, single node). I 
can't perform write operations in case the column and supercolumn families have
different comparator types. For example if I use the code given in Issue: 
https://issues.apache.org/jira/browse/CASSANDRA-1712 by Jonathan Ellis in the 
CLI, I get the following output:

[default@Keyspace1] create keyspace KS1
8bb2fc2d-1fcb-11e0-add0-a9c93d38c544
[default@Keyspace1] use KS1
Authenticated to keyspace: KS1
[default@KS1] create column family CFCli with column_type= 'Super' and 
comparator= 'LongType' and subcomparator='UTF8Type'
97742bbe-1fcb-11e0-add0-a9c93d38c544
[default@KS1] set CFCli['newrow'][1234567890]['column'] = 'value'
'column' could not be translated into a LongType.

I also tried a setup with the release inclosed example keyspace (loaded via the 
StorageService bean loadSchemaFromYAML method):

   ColumnFamily: Super3 (Super)
   A column family with supercolumns, whose column names are Longs (8 bytes)
 Columns sorted by: 
org.apache.cassandra.db.marshal.LongType/org.apache.cassandra.db.marshal.BytesType
 Subcolumns sorted by: org.apache.cassandra.db.marshal.LongType
 Row cache size / save period: 0.0/0
 Key cache size / save period: 20.0/3600
 Memtable thresholds: 0.2953125/63/60
 GC grace seconds: 864000
 Compaction min/max thresholds: 4/32

CLI output:

[default@Keyspace1] set Super3['account_value']['1:1'][1234567890] = 'value1'   
 
A long is exactly 8 bytes: 3
[default@Keyspace1] set Super3['account_value'][1234567890]['test'] = 'value1'
'test' could not be translated into a LongType.
[default@Keyspace1] set Super3['account_value'][1234567890][1234567890] = 
'value1'
A long is exactly 8 bytes: 10
[default@Keyspace1] set Super3[1234567890][1234567890][1234567890] = 'value1' 
Syntax error at position 11: mismatched input '1234567890' expecting set null
[default@Keyspace1] set Super3['account_value']['test'][1234567890] = 'value1'
A long is exactly 8 bytes: 4
[default@Keyspace1] set Super3[1234567890]['test']['column'] = 'value1'   
Syntax error at position 11: mismatched input '1234567890' expecting set null


According to the CLI help the format is: set cf['key']['super']['col'] 
= value, thus the errors generated seem weird for me. What am I doing wrong?


Thanks in advance,

Kind regards,

Karin

Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
@Roshan
Yes, I thought about that, but then I wouldn't be able to use the
Random Partitioner.

@Aaron

Do you mean like this: 'timeUUID+ row_key'  as the supercolumn names?
then when retriving the row_key from this column name, will I be
required to parse the name ? How do I do that exactly ?


Some issues:
- Will you have time collisions ?
No I wont be mostly having time collisions. If they happen in 1% case,
I dont mind.

- Not sure what your are storing in the super columns, but their are 
limitations.
I would be storing maximum 5 subcolumns inside and would be retrieving
them altogether.

- If you are using cassandra 0.7, have you looked at the secondary indexes ?

Yes I did but I think they are not helpful in my case.

This is what I am trying to do :
**
This is from an older post that I made earlier on the mailing list:-
I am working on a project of Questions/answers forum that allows a
user to follow questions on certain topics from his followies.
I want to build user's news-feed that comprises of only those
questions that have been posted by his followies  tagged on the
topics that he is following.
Simple news-feed design that shows all the posts from network would be
easy to design using Cassandra by executing fast writes to all
followers of a user about the post from user. But for my application,
there is an additional filter of 'followed topics', (ie, the user
receives posts created by his followies  on topics user is
following)

I was thinking of implementing this way:
Initially writing to all followers, the postID of posts from their
network, by adding a supercolumn to the rows of all followers in the
News-feed supercolumnfamily, with supercolumn name as timestamp(for
sort by time) and 5 sub-columns containing the topic tags of that
post.
At the read time, compare subcolumn values with the topics user is
following, if they match then show the post. (I would be required to
fetch the list of followed topics of the user at read time, hence
should I store the topic list as a supercolumn in this Newsfeed
supercolumnfamily only?)

An important point to note that, often, the posts will have zero
subcolumn which would mean that this post has to be shown without
validating with the user's list of followed topics.

There is another view for the users which allows them to see all the
posts from their followies(without topic filters). In this case no
checking of subcolumns for topics will be performed.

I got good insights from Tyler on this, but he was recommending me an
approach which although would be beneficial for reads performance, but
by way of too much denormalizing like 70-80x. I currently fear that
approach and would like to test upon this.
**
any comments, feedback greatly appreciated..

thanks so much!

On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote:
 It's possible that I am misunderstanding the question in some way.

 The row keys can be Time UUIDs and with those row keys as column names, u
 can use comparator TIMEUUIDTYPE to have them sorted by time automatically.

 On Fri, Jan 14, 2011 at 9:18 AM, Aaron Morton
 aa...@thelastpickle.comwrote:

 You could make the time an a fixed width integer and prefix your row keys
 with it, then set the comparotor to ascii or utf.

 Some issues:
 - Will you have time collisions ?
 - Not sure what your are storing in the super columns, but their are
 limitations http://wiki.apache.org/cassandra/CassandraLimitations
 http://wiki.apache.org/cassandra/CassandraLimitations- If you are using
 cassandra 0.7, have you looked at the secondary indexes ?
 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes

 http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexesIf
 you provide some more info on the problem your trying to solve we may be
 able to help some more.

 Cheers
 Aaron


 On 14 Jan, 2011,at 04:27 PM, Aklin_81 asdk...@gmail.com wrote:

 I would like to keep the reference of other rows as names of super
 column and sort those super columns according to time.
 Is there any way I could implement that ?

 Thanks in advance!




 --
 Roshan
 Blog: http://roshandawrani.wordpress.com/
 Twitter: @roshandawrani http://twitter.com/roshandawrani
 Skype: roshandawrani

#
 #
 #   #



Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Roshan Dawrani
On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote:

 @Roshan
 Yes, I thought about that, but then I wouldn't be able to use the
 Random Partitioner.


Can you please expand a bit on this? What is this restriction? Can you point
me to some relevant documentation on this?

Thanks.
   #
#
#   #


Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Rajkumar Gupta
I am not sure but I guess because all the rows of certain time range will go
to just one node  will not be evenly distributed because the timeUUID will
not be random but sequential according to time... I am not sure anyways...

On Fri, Jan 14, 2011 at 7:18 PM, Roshan Dawrani roshandawr...@gmail.comwrote:

 On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote:

 @Roshan
 Yes, I thought about that, but then I wouldn't be able to use the
 Random Partitioner.


 Can you please expand a bit on this? What is this restriction? Can you
 point me to some relevant documentation on this?

 Thanks.

 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_   
 #12d84d3a0b3ce961_12d84c9312ae2134_



Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
I too believed so!  but not totally sure.

On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote:
 I am not sure but I guess because all the rows of certain time range will go
 to just one node  will not be evenly distributed because the timeUUID will
 not be random but sequential according to time... I am not sure anyways...

 On Fri, Jan 14, 2011 at 7:18 PM, Roshan Dawrani
 roshandawr...@gmail.comwrote:

 On Fri, Jan 14, 2011 at 7:15 PM, Aklin_81 asdk...@gmail.com wrote:

 @Roshan
 Yes, I thought about that, but then I wouldn't be able to use the
 Random Partitioner.


 Can you please expand a bit on this? What is this restriction? Can you
 point me to some relevant documentation on this?

 Thanks.

 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_
 #12d84d3a0b3ce961_12d84c9312ae2134_




Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Roshan Dawrani
I am not clear what you guys are trying to do and say :-)

So, let's take some specifics...

Say you want to create rows in some column family (say CF_A), and as you
create them, you want to store their row key in column names in some other
column family (say CF_B) - possibly for filtering keys based on time later,
etc, etc...

Now your rows in CF_A may be keyed on a TimeUUID and if you store these keys
as column names in CF_B that has comparator as TimeUUID, then you get your
column names time sorted automatically.

Now CF_A may be split across nodes - is that of any concern to you?

Are you expecting any storage relationship between column names of CF_B and
rows of CF_A?

rgds,
Roshan

On Fri, Jan 14, 2011 at 7:58 PM, Aklin_81 asdk...@gmail.com wrote:

 I too believed so!  but not totally sure.

 On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote:
  I am not sure but I guess because all the rows of certain time range will
 go
  to just one node  will not be evenly distributed because the timeUUID
 will
  not be random but sequential according to time... I am not sure
 anyways...
 


   #
#
#   #


Problem starting Cassandra on Ubuntu

2011-01-14 Thread kh jo
Hi,

just installed Cassandra on Ubuntu using package manager

but I can not start it

I get the following error in the logs:

 INFO [main] 2011-01-14 15:37:49,758 AbstractCassandraDaemon.java (line 74) 
Heap size: 1051525120/1051525120
 WARN [main] 2011-01-14 15:37:49,826 CLibrary.java (line 73) Obsolete version 
of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later
 WARN [main] 2011-01-14 15:37:49,827 CLibrary.java (line 73) Obsolete version 
of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later
 WARN [main] 2011-01-14 15:37:49,827 CLibrary.java (line 105) Unknown mlockall 
error 0
 INFO [main] 2011-01-14 15:37:49,841 DatabaseDescriptor.java (line 121) Loading 
settings from file:/etc/cassandra/cassandra.yaml
ERROR [main] 2011-01-14 15:37:49,965 DatabaseDescriptor.java (line 388) Fatal 
error: null; mapping values are not allowed here




  

Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
I just read that cassandra internally creates a md5 hash that is used
for distributing the load by sending it to a node reponsible for the
range within which that md5 hash falls, so even when we create
sequential keys, their MD5 hash is not the same  hence they are not
sent to same node. This was my misunderstanding of this concept.
Sorry for creating confusions !

So.. with this I think I will be able to use timeUUID as row key !?

Aaron, if you could kindly share your views on my response to your
queries above.




On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote:
 I am not clear what you guys are trying to do and say :-)

 So, let's take some specifics...

 Say you want to create rows in some column family (say CF_A), and as you
 create them, you want to store their row key in column names in some other
 column family (say CF_B) - possibly for filtering keys based on time later,
 etc, etc...

 Now your rows in CF_A may be keyed on a TimeUUID and if you store these keys
 as column names in CF_B that has comparator as TimeUUID, then you get your
 column names time sorted automatically.

 Now CF_A may be split across nodes - is that of any concern to you?

 Are you expecting any storage relationship between column names of CF_B and
 rows of CF_A?

 rgds,
 Roshan

 On Fri, Jan 14, 2011 at 7:58 PM, Aklin_81 asdk...@gmail.com wrote:

 I too believed so!  but not totally sure.

 On 1/14/11, Rajkumar Gupta rajkumar@gmail.com wrote:
  I am not sure but I guess because all the rows of certain time range
  will
 go
  to just one node  will not be evenly distributed because the timeUUID
 will
  not be random but sequential according to time... I am not sure
 anyways...
 


#
 #
 #   #



Re: limiting columns in a row

2011-01-14 Thread Sylvain Lebresne
Hi,

 does this seem like a generally useful feature?

I do think this could be a useful feature. If only because I don't think
there
is any satisfactory/efficient way to do this client side.

 if so, would it be hard to implement (maybe it could be done at compaction
 time like the TTL feature)?

Out of the top of my hat (aka, I haven't really think that through but I'll
still give my opinion), I see the following difficulties:
  1) You can only do this limiting during major compaction or the same cases
 as CASSANDRA-1074 for minor, since you need to make sure the x columns
you
 are keeping are not deleted ones. Or you'll want to disable deletes
 altogether on the cf with this 'limit' option (I feel like this last
 option would really simplify things).
  2) Even if the removal of the column exceeding the limit is eventual (and
it
 will), you'll want query to only ever return column inside the limit
 (otherwise the feature would be too unpredictable). But I think this
will
 be quite challenging. That is, slice query from the start of the row
are
 easy. Everything else is harder (at least if you want to make it
efficient).

That was my 2 cents. Anyway, you can always open a JIRA ticket.

--
Sylvain


On Fri, Jan 14, 2011 at 7:38 AM, mike dooley doo...@apple.com wrote:

 hi,

 the time-to-live feature in 0.7 is very nice and it made me want to ask
 about
 a somewhat similar feature.

 i have a stream of data consisting of entities and associated samples.  so
 i create
 a row for each entity and the columns in each row contain the samples for
 that entity.
 when i get around to processing  an entity i only care about the most
 recent N samples.
 so i read the most recent N columns and delete all the rest.

 what i would like would be a column family property that allows me to
 specify a maximum number of columns per row.  then i could just keep
 writing
 and not have to do the deletes.

 in my case it would be fine if the limit is only 'eventually' applied (so
 that
 sometimes there might be extra columns).

 does this seem like a generally useful feature?  if so, would it be hard to
 implement (maybe it could be done at compaction time like the TTL feature)?

 thanks,
 -mike


live data migration from mysql to cassandra

2011-01-14 Thread ruslan usifov
Hello

Dear community please share your experience, home you make live(without
stop) migration from mysql or other RDBM to cassandra


Re: Is there any way I could use keys of other rows as column names that could be sorted according to time ?

2011-01-14 Thread Aklin_81
No,  you do not need to shut up, please! :)
you may be clearing up my further misconceptions on the topic!

Anyways, the link b/w 1st and 2nd para was that since the rows
distribution among nodes is not affected by key(as you rightly said)
but by md5 hash of the key thus I can use just any key including the
timeUUIDType key (that would be helpful in my case) with Random
partition.



On 1/14/11, Roshan Dawrani roshandawr...@gmail.com wrote:
 On Fri, Jan 14, 2011 at 8:51 PM, Aklin_81 asdk...@gmail.com wrote:

 I just read that cassandra internally creates a md5 hash that is used
 for distributing the load by sending it to a node reponsible for the
 range within which that md5 hash falls, so even when we create
 sequential keys, their MD5 hash is not the same  hence they are not
 sent to same node. This was my misunderstanding of this concept.
 Sorry for creating confusions !

 So.. with this I think I will be able to use timeUUID as row key !?


 Now, what really is the link between your corrected understanding and the
 conclusion in the 2nd para? :-)

 I miss the link you are using to come from para 1 to para 2.

 Just because you use time UUID as the row key, there is no storage guarantee
 because of that. Distribution of rows and ordering across nodes is only
 based on what partitioner you are using - it is not (only) related to the
 the type of the key.

 May be I should just shut up now as I don't seem to be understanding you
 requirement :-)







   #
 #
 #   #



Re: live data migration from mysql to cassandra

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 10:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 Hello

 Dear community please share your experience, home you make live(without
 stop) migration from mysql or other RDBM to cassandra


There is no built in way to do this. I remember hearing at hadoop
world this year that the hbase guys have a system to read mysql slave
logs and replay into hbase. Since all the nosql community seems to do
this maybe we can 'borrow' this idea.

Edward


Do you have a site in production environment with Cassandra? What client do you use?

2011-01-14 Thread Ertio Lew
Hey,

If you have a site in production environment or considering so, what
is the client that you use to interact with Cassandra. I know that
there are several clients available out there according to the
language you use but I would love to know what clients are being used
widely in production environments and are best to work with(support
most required features for performance).

Also preferably tell about the technology stack for your applications.

Any suggestions, comments appreciated ?

Thanks
Ertio


Re: cassandra row cache

2011-01-14 Thread Mike Malone
Digest reads could be being dropped..?

On Thu, Jan 13, 2011 at 4:11 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Thu, Jan 13, 2011 at 2:00 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  Is it possible that your are reading at READ.ONE and that READ.ONE
  only warms cache on 1 of your three nodes= 20. 2nd read warms another
  60%, and by the third read all the replicas are warm? 99% ?
 
  This would be true if digest reads were not warming caches.

 Digest reads do go through the cache path.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: cassandra row cache

2011-01-14 Thread Jonathan Ellis
That's possible, yes.  He'd want to make sure there aren't any of
those WARN messages in the logs.

On Fri, Jan 14, 2011 at 11:46 AM, Mike Malone m...@simplegeo.com wrote:
 Digest reads could be being dropped..?

 On Thu, Jan 13, 2011 at 4:11 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Thu, Jan 13, 2011 at 2:00 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  Is it possible that your are reading at READ.ONE and that READ.ONE
  only warms cache on 1 of your three nodes= 20. 2nd read warms another
  60%, and by the third read all the replicas are warm? 99% ?
 
  This would be true if digest reads were not warming caches.

 Digest reads do go through the cache path.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-14 Thread Ran Tavory
I use Hector,  if that counts. ..
On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote:
 Hey,

 If you have a site in production environment or considering so, what
 is the client that you use to interact with Cassandra. I know that
 there are several clients available out there according to the
 language you use but I would love to know what clients are being used
 widely in production environments and are best to work with(support
 most required features for performance).

 Also preferably tell about the technology stack for your applications.

 Any suggestions, comments appreciated ?

 Thanks
 Ertio


Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-14 Thread Ertio Lew
what is the technology stack do you use?

On 1/14/11, Ran Tavory ran...@gmail.com wrote:
 I use Hector,  if that counts. ..
 On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote:
 Hey,

 If you have a site in production environment or considering so, what
 is the client that you use to interact with Cassandra. I know that
 there are several clients available out there according to the
 language you use but I would love to know what clients are being used
 widely in production environments and are best to work with(support
 most required features for performance).

 Also preferably tell about the technology stack for your applications.

 Any suggestions, comments appreciated ?

 Thanks
 Ertio



Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-14 Thread Ran Tavory
Java
On Jan 14, 2011 8:25 PM, Ertio Lew ertio...@gmail.com wrote:
 what is the technology stack do you use?

 On 1/14/11, Ran Tavory ran...@gmail.com wrote:
 I use Hector, if that counts. ..
 On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote:
 Hey,

 If you have a site in production environment or considering so, what
 is the client that you use to interact with Cassandra. I know that
 there are several clients available out there according to the
 language you use but I would love to know what clients are being used
 widely in production environments and are best to work with(support
 most required features for performance).

 Also preferably tell about the technology stack for your applications.

 Any suggestions, comments appreciated ?

 Thanks
 Ertio



phpcassa never return(infinite loop)?!!!

2011-01-14 Thread kh jo
I am trying to use phpcasse


 I use the following example 
 CassandraConn::add_node('localhost', 9160);

 $users = new CassandraCF('rhg', 'Users'); // ColumnFamily

 $users-insert('1', array('email' = 't...@example.com', 'password' =

 'test')); 
 when I run it, it never returns,,, and apache process eats 100% CPU 
 I am using cassandra 0.7

any idea why this happens?

thanks



  

Cassandra in less than 1G of memory?

2011-01-14 Thread Rajat Chopra
Hello.

According to  JVM heap size topic at 
http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need 
atleast 1G of memory to run. Is it possible to have a running Cassandra cluster 
with machines that have less than that memory... say 512M?
I can live with slow transactions, no compactions etc, but do not want an 
OutOfMemory error. The reason for a smaller bound for Cassandra is that I want 
to leave room for other processes to run.

Please help with specific parameters to tune.

Thanks,
Rajat



Re: Newbie Replication/Cluster Question

2011-01-14 Thread Mark Moseley
On Thu, Jan 13, 2011 at 2:32 PM, Mark Moseley moseleym...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 1:08 PM, Gary Dusbabek gdusba...@gmail.com wrote:
 It is impossible to properly bootstrap a new node into a system where
 there are not enough nodes to satisfy the replication factor.  The
 cluster as it stands doesn't contain all the data you are asking it to
 replicate on the new node.

 Ok, maybe I'm thinking of replication_factor backwards. I took it to
 mean how many nodes would have *full* copies of the whole of the
 keyspace's data, in which case with my keyspace with
 replication_factor=2 the still-alive node would have 100% of the data
 to replicate to the wiped-clean node--in which case all the data would
 be there to bootstrap. I was assuming replication_factor=2 in a 2-node
 cluster == both nodes having a full replica of the data. Do I have
 that wrong?

 What's also confusing is that I did this same test on a clean node
 that wasn't clustered yet (which is interesting that it doesn't
 complain then about replication_factor  # of nodes), so unless it was
 throwing away data as I was inserting it, it'd all be there.

 Is the general rule then that the max. replication factor must be
 #_of_nodes-1 then? If replication_factor==#_of_nodes, then if you lost
 a box, it seems like your cluster would be toast.


Perhaps the better question would be, if I have a two node cluster and
I want to be able to lose one box completely and replace it (without
losing the cluster), what settings would I need? Or is that an
impossible scenario? In production, I'd imagine a 3 node cluster being
the minimum but even there I could see each box having a full replica,
but probably not beyond 3.


Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-14 Thread Dan Kuebrich
We've done hundreds of gigs in and out of cassandra 0.6.8 with pycassa 0.3.
 Working on upgrading to 0.7 and pycassa 1.03.

I don't know if we're using it wrong, but the connection object is tied to
a particular keyspace constraint isn't that awesome--we have a number of
keyspaces used simultaneously.  Haven't looked into it yet.

On Fri, Jan 14, 2011 at 1:52 PM, Mike Wynholds m...@carbonfive.com wrote:

 We have one in production with Ruby / fauna Cassandra gem and Cassandra
 0.6.x.  The project is live but is stuck in a sort of private beta, so it
 hasn't really been run through any load scenarios.

 ..mike..

 --
 Michael Wynholds | Carbon Five | 310.821.7125 x13 | m...@carbonfive.com



 On Fri, Jan 14, 2011 at 9:24 AM, Ertio Lew ertio...@gmail.com wrote:

 Hey,

 If you have a site in production environment or considering so, what
 is the client that you use to interact with Cassandra. I know that
 there are several clients available out there according to the
 language you use but I would love to know what clients are being used
 widely in production environments and are best to work with(support
 most required features for performance).

 Also preferably tell about the technology stack for your applications.

 Any suggestions, comments appreciated ?

 Thanks
 Ertio





Re: Cassandra in less than 1G of memory?

2011-01-14 Thread Victor Kabdebon
Dear rajat,

Yes it is possible, I have the same constraints. However I must warn you,
from what I see Cassandra memory consumption is not bounded in 0.6.X on
debian 64 Bit

Here is an example of an instance launch in a node :

root 19093  0.1 28.3 1210696 *570052* ?  Sl   Jan11   9:08
/usr/bin/java -ea -Xms128M *-Xmx512M *-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-Dcom.sun.management.jmxremote.port=8081
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
org.apache.cassandra.thrift.CassandraDaemon

Look at the second bold value, Xmx indicates the maximum memory that
cassandra can use; it is set to be 512, so it could easily fit into 1 Gb.
Now look at the first one, 570Mb  512 Mb. Moreover if I come back in one
day the first value will be even higher. Probably around 610 Mb. Actually it
increases to the point where I need to restart it otherwise other program
are shut down by Linux for cassandra to further expand its memory usage...

By the way it's a call to other cassandra users, am I the only one to
encounter this problem ?

Best regards,

Victor K.

2011/1/14 Rajat Chopra rcho...@makara.com

 Hello.



 According to  JVM heap size topic at
 http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need
 atleast 1G of memory to run. Is it possible to have a running Cassandra
 cluster with machines that have less than that memory… say 512M?

 I can live with slow transactions, no compactions etc, but do not want an
 OutOfMemory error. The reason for a smaller bound for Cassandra is that I
 want to leave room for other processes to run.



 Please help with specific parameters to tune.



 Thanks,

 Rajat





Re: Newbie Replication/Cluster Question

2011-01-14 Thread Mark Moseley
 Perhaps the better question would be, if I have a two node cluster and
 I want to be able to lose one box completely and replace it (without
 losing the cluster), what settings would I need? Or is that an
 impossible scenario? In production, I'd imagine a 3 node cluster being
 the minimum but even there I could see each box having a full replica,
 but probably not beyond 3.

Or perhaps, in the case of losing a box completely in a 2-node RF=2
cluster, do I need to lower the replication_factor on the still-alive
box, bootstrap the replaced node back in, and then change the
replication_factor=2?


Cassandra-Maven-Plugin

2011-01-14 Thread Stephen Connolly
OK,

I nearly have the Cassandra-Maven-Plugin ready.

It has the following goals:
  run: launches Cassandra in the foreground and blocks until you press
^C at which point Maven terminates. Use-case: Running integration
tests from your IDE. Live development from your IDE.

  start: launches Cassandra in the background. Cassandra will be torn
down when Maven ends or if the stop goal is called. Use-case: Running
integration tests from Maven. Live development from your IDE with e.g.
jetty

  clean: Clears out the Cassandra database directory in
${basedir}/target/cassandra. Use-case: Resetting the dataset.

  load: Runs the cassandra-cli with a file as input.  Use-case:
Creating Keyspaces  pre-populating the dataset

  stop: Shuts down the background Cassandra instance started by start.
Use-case: Running integration tests from Maven.

So for example, if you are developing a web application using Maven
you would use a command like:

mvn cassandra:clean cassandra:start cassandra:load jetty:run

which would start up cassandra with a clean dataset and then start up
jetty (which presumably connects via a client library to cassandra).

Similarly, you can use cassandra-maven-plugin, jetty-maven-plugin,
maven-failsafe-plugin and selenium-maven-plugin to run web integration
tests as part of your build.

So I have some questions:

1. Is there a standard file extension for the scripts that get passed
to cassandra-cli?

2. Is there any other obvious goal I have missed out on?

There is a small bit of tidy-up left and then I just have to add some
integration tests and the site documentation.  Once I have all that in
place I will raise a JIRA with the full source code against CASSANDRA
and hopefully a friendly committer will pick it up and commit it into
the tree. While waiting for a committer testers will be welcome.

If it gets accepted I will then see about getting it released and
published on central.

Expect to see the JIRA sometime Monday or Tuesday.

-Stephen


Re: Newbie Replication/Cluster Question

2011-01-14 Thread Mark Moseley
On Fri, Jan 14, 2011 at 4:29 PM, Aaron Morton aa...@thelastpickle.com wrote:
 Here's some slides I did last year that have a simple explanation of RF 
 http://www.slideshare.net/mobile/aaronmorton/well-railedcassandra24112010-5901169

 Short version is, generally no single node contains all the data in the db.
 Normally the RF is going to be less than the number of nodes, and the higher 
 the rf the number of concurrent node failure you can handle (when writing at 
 Quorum).

 - at rf3 you can keep reading and writing with 1 node down. If you lose a 
 second node the cluster will appear to be down for a portion of the keys. The 
 portion depends on the total number of nodes.
 - at rf 5 the cluster will be up for all keys if you have 2 nodes down. If 
 you have 3 down the cluster will appear down for only a portion of the keys, 
 again the portion depends on the total number of nodes.

 Its a bit more complicated though, when I say 'node is down' I mean one of 
 the nodes that the key would have been written to is down (the 3 or 5 above). 
 So if you had 10 nodes, rf 5, you could have 4 nodes down and the cluster be 
 available for all keys. So long as there are still 3 natural endpoints for 
 each key.

 Hope that helps.

 Aaron

 On 15/01/2011, at 8:52 AM, Mark Moseley moseleym...@gmail.com wrote:

 Perhaps the better question would be, if I have a two node cluster and
 I want to be able to lose one box completely and replace it (without
 losing the cluster), what settings would I need? Or is that an
 impossible scenario? In production, I'd imagine a 3 node cluster being
 the minimum but even there I could see each box having a full replica,
 but probably not beyond 3.

 Or perhaps, in the case of losing a box completely in a 2-node RF=2
 cluster, do I need to lower the replication_factor on the still-alive
 box, bootstrap the replaced node back in, and then change the
 replication_factor=2?


Excellent, thanks! I'll definitely be checking those out.  I just want
to make sure I've got the hang of DR before we start deploying
Cassandra, and I'd hate to figure all this out later on with angry
customers standing over my shoulder :)


Re: Cassandra in less than 1G of memory?

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon
victor.kabde...@gmail.com wrote:
 Dear rajat,

 Yes it is possible, I have the same constraints. However I must warn you,
 from what I see Cassandra memory consumption is not bounded in 0.6.X on
 debian 64 Bit

 Here is an example of an instance launch in a node :

 root 19093  0.1 28.3 1210696 570052 ?  Sl   Jan11   9:08
 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
 bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
 org.apache.cassandra.thrift.CassandraDaemon

 Look at the second bold value, Xmx indicates the maximum memory that
 cassandra can use; it is set to be 512, so it could easily fit into 1 Gb.
 Now look at the first one, 570Mb  512 Mb. Moreover if I come back in one
 day the first value will be even higher. Probably around 610 Mb. Actually it
 increases to the point where I need to restart it otherwise other program
 are shut down by Linux for cassandra to further expand its memory usage...

 By the way it's a call to other cassandra users, am I the only one to
 encounter this problem ?

 Best regards,

 Victor K.

 2011/1/14 Rajat Chopra rcho...@makara.com

 Hello.



 According to  JVM heap size topic at
 http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need
 atleast 1G of memory to run. Is it possible to have a running Cassandra
 cluster with machines that have less than that memory… say 512M?

 I can live with slow transactions, no compactions etc, but do not want an
 OutOfMemory error. The reason for a smaller bound for Cassandra is that I
 want to leave room for other processes to run.



 Please help with specific parameters to tune.



 Thanks,

 Rajat




-Xmx512M is not an overall memory limit. MMAP'ed files also consume
memory. Try turning disk access mode to standard not (MMAP or
MMAP_INDEX_ONLY).


is it possible to map an one from a a file and an one from cassandra?

2011-01-14 Thread 김준영
hi, 

cassandra supports hadoop to map  reduce from cassandra.

now I am digging to find out a way to map from a file and cassandra together.

I mean if both of them are files in my disk, it is possible by using splits.

but, in this kind of a situtation, which way is posssible?

for example. 

in a cassandra)
key1| value1 | value2
key2| value3 | value4
key3| value5 | value6

in a file)
key1| value1 | value2
key2| value7 | value4
key3| value7 | value6


the size of both are very hugh.
I want to get a result from diff from both of them.

which keys are deleted?
which values are changed?

thanks.


Re: Cassandra in less than 1G of memory?

2011-01-14 Thread Jonathan Ellis
mmapping only consumes memory that the OS can afford to feed it.

On Fri, Jan 14, 2011 at 7:29 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon
 victor.kabde...@gmail.com wrote:
 Dear rajat,

 Yes it is possible, I have the same constraints. However I must warn you,
 from what I see Cassandra memory consumption is not bounded in 0.6.X on
 debian 64 Bit

 Here is an example of an instance launch in a node :

 root 19093  0.1 28.3 1210696 570052 ?  Sl   Jan11   9:08
 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
 bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
 org.apache.cassandra.thrift.CassandraDaemon

 Look at the second bold value, Xmx indicates the maximum memory that
 cassandra can use; it is set to be 512, so it could easily fit into 1 Gb.
 Now look at the first one, 570Mb  512 Mb. Moreover if I come back in one
 day the first value will be even higher. Probably around 610 Mb. Actually it
 increases to the point where I need to restart it otherwise other program
 are shut down by Linux for cassandra to further expand its memory usage...

 By the way it's a call to other cassandra users, am I the only one to
 encounter this problem ?

 Best regards,

 Victor K.

 2011/1/14 Rajat Chopra rcho...@makara.com

 Hello.



 According to  JVM heap size topic at
 http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need
 atleast 1G of memory to run. Is it possible to have a running Cassandra
 cluster with machines that have less than that memory… say 512M?

 I can live with slow transactions, no compactions etc, but do not want an
 OutOfMemory error. The reason for a smaller bound for Cassandra is that I
 want to leave room for other processes to run.



 Please help with specific parameters to tune.



 Thanks,

 Rajat




 -Xmx512M is not an overall memory limit. MMAP'ed files also consume
 memory. Try turning disk access mode to standard not (MMAP or
 MMAP_INDEX_ONLY).




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Cassandra in less than 1G of memory?

2011-01-14 Thread Victor Kabdebon
Hi Jonathan, hi Edward,

Jonathan : but it looks like mmaping wants to consume the entire memory of
my server. It goes up to 1.7 Gb for a ridiculously small amount of data.
Am I doing something wrong or is there something I should change to prevent
this never ending increase of memory consumption ?
Edward : I am not sure, I will try to see that tomorrow but my disk access
mode is standard, not mmap.

Anyway thank you very much,
Victor K.

PS : here is some hours after the result of ps aux | grep cassandra
root 19093  0.1 30.0 1243940 *605060* ?  Sl   Jan11  10:15
/usr/bin/java -ea -Xms128M *-Xmx512M* -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-Dcom.sun.management.jmxremote.port=8081
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
org.apache.cassandra.thrift.CassandraDaemon


2011/1/15 Jonathan Ellis jbel...@gmail.com

 mmapping only consumes memory that the OS can afford to feed it.

 On Fri, Jan 14, 2011 at 7:29 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon
  victor.kabde...@gmail.com wrote:
  Dear rajat,
 
  Yes it is possible, I have the same constraints. However I must warn
 you,
  from what I see Cassandra memory consumption is not bounded in 0.6.X on
  debian 64 Bit
 
  Here is an example of an instance launch in a node :
 
  root 19093  0.1 28.3 1210696 570052 ?  Sl   Jan11   9:08
  /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC
  -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
 -XX:MaxTenuringThreshold=1
  -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
  -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
 
 bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
  org.apache.cassandra.thrift.CassandraDaemon
 
  Look at the second bold value, Xmx indicates the maximum memory that
  cassandra can use; it is set to be 512, so it could easily fit into 1
 Gb.
  Now look at the first one, 570Mb  512 Mb. Moreover if I come back in
 one
  day the first value will be even higher. Probably around 610 Mb.
 Actually it
  increases to the point where I need to restart it otherwise other
 program
  are shut down by Linux for cassandra to further expand its memory
 usage...
 
  By the way it's a call to other cassandra users, am I the only one to
  encounter this problem ?
 
  Best regards,
 
  Victor K.
 
  2011/1/14 Rajat Chopra rcho...@makara.com
 
  Hello.
 
 
 
  According to  JVM heap size topic at
  http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would
 need
  atleast 1G of memory to run. Is it possible to have a running Cassandra
  cluster with machines that have less than that memory… say 512M?
 
  I can live with slow transactions, no compactions etc, but do not want
 an
  OutOfMemory error. The reason for a smaller bound for Cassandra is that
 I
  want