Re: CQL 3 and wide rows

2014-05-20 Thread Maciej Miklas
Thank you Nate - now I understand it ! This is real improvement when compared 
to CLI :)

Regards,
Maciej


On 20 May 2014, at 17:16, Nate McCall  wrote:

> Something like this might work:
> 
> 
> cqlsh:my_keyspace> CREATE TABLE my_widerow (
>  ...   id text,
>  ...   my_col timeuuid,
>  ...   PRIMARY KEY (id, my_col)
>  ... ) WITH caching='KEYS_ONLY' AND
>  ...   compaction={'class': 'LeveledCompactionStrategy'};
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> insert into my_widerow (id, my_col) values 
> ('some_key_1',now());
> cqlsh:my_keyspace> select * from my_widerow;
> 
>  id | my_col
> +--
>  some_key_1 | 7266d240-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 73ba0630-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 76227ab0-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 76cfd1b0-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 777364b0-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 7aa061b0-e030-11e3-a50d-8b2f9bfbfa10
> 
> cqlsh:my_keyspace> select * from my_widerow where id = 'some_key_1' and 
> my_col > 73ba0630-e030-11e3-a50d-8b2f9bfbfa10;
> 
>  id | my_col
> +--
>  some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 76227ab0-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 76cfd1b0-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 777364b0-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 7aa061b0-e030-11e3-a50d-8b2f9bfbfa10
> 
> cqlsh:my_keyspace> select * from my_widerow where id = 'some_key_1' and 
> my_col > 73ba0630-e030-11e3-a50d-8b2f9bfbfa10 and my_col < 
> 76227ab0-e030-11e3-a50d-8b2f9bfbfa10;
> 
>  id | my_col
> +--
>  some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
>  some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
> 
> 
> 
> These queries would all work fine from the DS Java Driver. Note that only the 
> cells that are needed are pulled into memory:
> 
> 
> ./bin/nodetool cfstats my_keyspace my_widerow
>...
>Column Family: my_widerow
>...
>Average live cells per slice (last five minutes): 6.0
>...
> 
> 
> This shows that we are slicing across 6 rows on average for the last couple 
> of select statements. 
> 
> Hope that helps.
> 
> 
> 
> -- 
> -
> Nate McCall
> Austin, TX
> @zznate
> 
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com



Re: CQL 3 and wide rows

2014-05-20 Thread Maciej Miklas
Hi Aron,

Thanks for the answer!


Lest consider such CLI code:

for(int i = 0 ; i < 10_000_000 ; i++) {
  set[‘rowKey1’][‘myCol::i’] = UUID.randomUUID();
}


The code above will create single row, that contains 10^6 columns sorted by 
‘i’. This will work fine, and this is the wide row to my understanding - row 
that holds many columns AND I can read only some part of it by right slice 
query. On the other hand side, I can iterate over all columns without latencies 
because data is stored on single node. I’ve been using similar structures as 
replacement for secondary indexes - it’s well known pattern.

How would I model it in CQL 3?

1) I could create Map, but Maps are fully loaded into memory, and Map 
containing 10^6 elements is definitely a problem. Plus it’s a big waste of RAM 
if you consider that I need only to read small subset.

2) I could alter table for each new column, which would create similar 
structure to this one from my CLI example. But it looks to me that all columns 
names are loaded into ram, which is still large limitation. I hope that I am 
wrong here - I am not sure.

3) I could redesign my model and divide data into many rows, but why would I do 
that, if I can use wide rows.

My idea of wide row, is a row that can hold large amount of key-value pairs (in 
any form), where I can filter on those keys to efficiently load only that part 
which I currently need.


Regards,
Maciej 


On 20 May 2014, at 09:06, Aaron Morton  wrote:

> In a CQL 3 table the only **column** names are the ones defined in the table, 
> in the example below there are three column names. 
> 
> 
>>> CREATE TABLE keyspace.widerow (
>>> row_key text,
>>> wide_row_column text,
>>> data_column text,
>>> PRIMARY KEY (row_key, wide_row_column));
>>> 
>>> Check out, for example, 
>>> http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
> 
> Internally there may be more **cells** ( as we now call the internal 
> columns). In the example above each value for row_key will create a single 
> partition (as we now call internal storage engine rows). In each of those 
> partitions there will be cells for each CQL 3 row that has the same row_key, 
> those cells will use a Composite for the name. The first part of the 
> composite will be the value of the wide_row_column and the second will be the 
> literal name of the non primary key columns. 
> 
> IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than 
> thrift models. 
> 
>> But still - I do not see Iteration, so it looks to me that CQL 3 is limited 
>> when compared to CLI/Hector.
> Now days you can do pretty much everything you can in cli. Provide an example 
> and we may be able to help. 
> 
> Cheers
> Aaron
> 
> -
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 20/05/2014, at 8:18 am, Maciej Miklas  wrote:
> 
>> Hi James,
>> 
>> Clustering is based on rows. I think that you meant not clustering columns, 
>> but compound columns. Still all columns belong to single table and are 
>> stored within single folder on one computer. And it looks to me (but I’am 
>> not sure) that CQL 3 driver loads all column names into memory - which is 
>> confusing to me. From one side we have wide row, but we load whole into 
>> ram…..
>> 
>> My understanding of wide row is a row that supports millions of columns, or 
>> similar things like map or set. In CLI you would generate column names (or 
>> use compound columns) to simulate set or map,  in CQL 3 you would use some 
>> static names plus Map or Set structures, or you could still alter table and 
>> have large number of columns. But still - I do not see Iteration, so it 
>> looks to me that CQL 3 is limited when compared to CLI/Hector.
>> 
>> 
>> Regards,
>> Maciej
>> 
>> On 19 May 2014, at 17:30, James Campbell  
>> wrote:
>> 
>>> Maciej,
>>> 
>>> In CQL3 "wide rows" are expected to be created using clustering columns.  
>>> So while the schema will have a relatively smaller number of named columns, 
>>> the effect is a wide row.  For example:
>>> 
>>> CREATE TABLE keyspace.widerow (
>>> row_key text,
>>> wide_row_column text,
>>> data_column text,
>>> PRIMARY KEY (row_key, wide_row_column));
>>> 
>>> Check out, for example, 
>>> http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
>>> 
>>> James
>>> From: Maciej Miklas 
>>> Sent: Monday, May 19, 2014 11:20 AM
>>> To: user@cassandra.apache.org
>>> Subject: CQL 3 and wide rows
>>>  
>>> Hi *,
>>> 
>>> I’ve checked DataStax driver code for CQL 3, and it looks like the column 
>>> names for particular table are fully loaded into memory, it this true?
>>> 
>>> Cassandra should support wide rows, meaning tables with millions of 
>>> columns. Knowing that, I would expect kind of iterator for column names. Am 
>>> I missing something here? 
>>> 
>>> 
>>> Regards,
>>> Maciej Miklas
>> 
> 



Re: CQL 3 and wide rows

2014-05-20 Thread Maciej Miklas
yes :)

On 20 May 2014, at 14:24, Jack Krupansky  wrote:

> To keep the terminology clear, your “row_key” is actually the “partition 
> key”, and “wide_row_column” is actually a “clustering column”, and the 
> combination of your row_key and wide_row_column is a “compound primary key”.
>  
> -- Jack Krupansky
>  
> From: Aaron Morton
> Sent: Tuesday, May 20, 2014 3:06 AM
> To: Cassandra User
> Subject: Re: CQL 3 and wide rows
>  
> In a CQL 3 table the only **column** names are the ones defined in the table, 
> in the example below there are three column names. 
>  
>  
>>> CREATE TABLE keyspace.widerow (
>>> row_key text,
>>> wide_row_column text,
>>> data_column text,
>>> PRIMARY KEY (row_key, wide_row_column));
>>>  
>>> Check out, for example, 
>>> http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
>  
> Internally there may be more **cells** ( as we now call the internal 
> columns). In the example above each value for row_key will create a single 
> partition (as we now call internal storage engine rows). In each of those 
> partitions there will be cells for each CQL 3 row that has the same row_key, 
> those cells will use a Composite for the name. The first part of the 
> composite will be the value of the wide_row_column and the second will be the 
> literal name of the non primary key columns.
>  
> IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than 
> thrift models.
>  
>> But still - I do not see Iteration, so it looks to me that CQL 3 is limited 
>> when compared to CLI/Hector.
> Now days you can do pretty much everything you can in cli. Provide an example 
> and we may be able to help.
>  
> Cheers
> Aaron
>  
> -----
> Aaron Morton
> New Zealand
> @aaronmorton
>  
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>  
> On 20/05/2014, at 8:18 am, Maciej Miklas  wrote:
> 
>> Hi James,
>>  
>> Clustering is based on rows. I think that you meant not clustering columns, 
>> but compound columns. Still all columns belong to single table and are 
>> stored within single folder on one computer. And it looks to me (but I’am 
>> not sure) that CQL 3 driver loads all column names into memory - which is 
>> confusing to me. From one side we have wide row, but we load whole into 
>> ram…..
>>  
>> My understanding of wide row is a row that supports millions of columns, or 
>> similar things like map or set. In CLI you would generate column names (or 
>> use compound columns) to simulate set or map,  in CQL 3 you would use some 
>> static names plus Map or Set structures, or you could still alter table and 
>> have large number of columns. But still - I do not see Iteration, so it 
>> looks to me that CQL 3 is limited when compared to CLI/Hector.
>>  
>>  
>> Regards,
>> Maciej
>>  
>> On 19 May 2014, at 17:30, James Campbell  
>> wrote:
>> 
>>> Maciej,
>>>  
>>> In CQL3 "wide rows" are expected to be created using clustering columns.  
>>> So while the schema will have a relatively smaller number of named columns, 
>>> the effect is a wide row.  For example:
>>>  
>>> CREATE TABLE keyspace.widerow (
>>> row_key text,
>>> wide_row_column text,
>>> data_column text,
>>> PRIMARY KEY (row_key, wide_row_column));
>>>  
>>> Check out, for example, 
>>> http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
>>>  
>>> James
>>> From: Maciej Miklas 
>>> Sent: Monday, May 19, 2014 11:20 AM
>>> To: user@cassandra.apache.org
>>> Subject: CQL 3 and wide rows
>>>  
>>> Hi *,
>>>  
>>> I’ve checked DataStax driver code for CQL 3, and it looks like the column 
>>> names for particular table are fully loaded into memory, it this true?
>>>  
>>> Cassandra should support wide rows, meaning tables with millions of 
>>> columns. Knowing that, I would expect kind of iterator for column names. Am 
>>> I missing something here?
>>>  
>>>  
>>> Regards,
>>> Maciej Miklas
>> 
>>  
> 
>  



Re: CQL 3 and wide rows

2014-05-19 Thread Maciej Miklas
Hi James,

Clustering is based on rows. I think that you meant not clustering columns, but 
compound columns. Still all columns belong to single table and are stored 
within single folder on one computer. And it looks to me (but I’am not sure) 
that CQL 3 driver loads all column names into memory - which is confusing to 
me. From one side we have wide row, but we load whole into ram…..

My understanding of wide row is a row that supports millions of columns, or 
similar things like map or set. In CLI you would generate column names (or use 
compound columns) to simulate set or map,  in CQL 3 you would use some static 
names plus Map or Set structures, or you could still alter table and have large 
number of columns. But still - I do not see Iteration, so it looks to me that 
CQL 3 is limited when compared to CLI/Hector.


Regards,
Maciej

On 19 May 2014, at 17:30, James Campbell  wrote:

> Maciej,
> 
> In CQL3 "wide rows" are expected to be created using clustering columns.  So 
> while the schema will have a relatively smaller number of named columns, the 
> effect is a wide row.  For example:
> 
> CREATE TABLE keyspace.widerow (
> row_key text,
> wide_row_column text,
> data_column text,
> PRIMARY KEY (row_key, wide_row_column));
> 
> Check out, for example, 
> http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
> 
> James
> From: Maciej Miklas 
> Sent: Monday, May 19, 2014 11:20 AM
> To: user@cassandra.apache.org
> Subject: CQL 3 and wide rows
>  
> Hi *,
> 
> I’ve checked DataStax driver code for CQL 3, and it looks like the column 
> names for particular table are fully loaded into memory, it this true?
> 
> Cassandra should support wide rows, meaning tables with millions of columns. 
> Knowing that, I would expect kind of iterator for column names. Am I missing 
> something here? 
> 
> 
> Regards,
> Maciej Miklas



Re: CQL 3 and wide rows

2014-05-19 Thread Maciej Miklas
Hallo Jack,

You have given a perfect example for wide row.  Each reading from sensor 
creates new column within a row. It was also possible with Hector/CLI to have 
millions of columns within a single row. According to this page 
http://wiki.apache.org/cassandra/CassandraLimitations single row can have 2 
billions columns.

How does this relate to CQL 3 and tables? 

I still do not understand it because:
- it looks like driver loads all column names into memory - it looks to me that 
the 2 billions limitation from CLI is not valid anymore
- Map and Set values do not support iterator 


Regards,
Maciej


On 19 May 2014, at 17:31, Jack Krupansky  wrote:

> You might want to review this blog post on supporting dynamic columns in 
> CQL3, which points out that “the way to model dynamic cells in CQL is with a 
> compound primary key.”
>  
> See:
> http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
>  
> -- Jack Krupansky
>  
> From: Maciej Miklas
> Sent: Monday, May 19, 2014 11:20 AM
> To: user@cassandra.apache.org
> Subject: CQL 3 and wide rows
>  
> Hi *,
>  
> I’ve checked DataStax driver code for CQL 3, and it looks like the column 
> names for particular table are fully loaded into memory, it this true?
>  
> Cassandra should support wide rows, meaning tables with millions of columns. 
> Knowing that, I would expect kind of iterator for column names. Am I missing 
> something here?
>  
>  
> Regards,
> Maciej Miklas



CQL 3 and wide rows

2014-05-19 Thread Maciej Miklas
Hi *,

I’ve checked DataStax driver code for CQL 3, and it looks like the column
names for particular table are fully loaded into memory, it this true?

Cassandra should support wide rows, meaning tables with millions of
columns. Knowing that, I would expect kind of iterator for column names. Am
I missing something here?


Regards,
Maciej Miklas


Re: Cyclop - CQL web based editor has been released!

2014-05-19 Thread Maciej Miklas
thanks - I've fixed it.

Regards,
Maciej


On Mon, May 12, 2014 at 2:50 AM, graham sanderson  wrote:

> Looks cool - giving it a try now (note FYI when building,
> TestDataConverter.java line 46 assumes a specific time zone)
>
> On May 11, 2014, at 12:41 AM, Maciej Miklas  wrote:
>
> Hi everybody,
>
> I am aware that this mailing list is meant for Cassandra users, but I’ve
> developed something that is strictly related to Cassandra, so I tough that
> it might be interesting for some of you.
> I’ve already sent one email several months ago, but since then a lot of
> things has changed!
>
> Cyclop is web based CQL editor - you can deploy it in web container and
> use it’s web interface to execute CQL queries or to import/export data.
> There is also live deployment, so you can try it out immediately. Of
> course the whole thing is open source.
>
> Hier is the project link containing all details:
> https://github.com/maciejmiklas/cyclop
>
> Regards,
> Maciej
>
>
>


subscription test - please ignore

2014-05-13 Thread Maciej Miklas


Cyclop - CQL web based editor has been released!

2014-05-10 Thread Maciej Miklas
Hi everybody,

I am aware that this mailing list is meant for Cassandra users, but I’ve 
developed something that is strictly related to Cassandra, so I tough that it 
might be interesting for some of you. 
I’ve already sent one email several months ago, but since then a lot of things 
has changed!

Cyclop is web based CQL editor - you can deploy it in web container and use 
it’s web interface to execute CQL queries or to import/export data. 
There is also live deployment, so you can try it out immediately. Of course the 
whole thing is open source.

Hier is the project link containing all details: 
https://github.com/maciejmiklas/cyclop

Regards,
Maciej

Re: Cassandra 1.2 : OutOfMemoryError: unable to create new native thread

2013-12-16 Thread Maciej Miklas
the cassandra-env.sh has option
JVM_OPTS="$JVM_OPTS -Xss180k"

it will give this error if you start cassandra with java 7. So increase the
value, or remove option.

Regards,
Maciej


On Mon, Dec 16, 2013 at 2:37 PM, srmore  wrote:

> What is your thread stack size (xss) ? try increasing that, that could
> help. Sometimes the limitation is imposed by the host provider (e.g.
> amazon ec2 etc.)
>
> Thanks,
> Sandeep
>
>
> On Mon, Dec 16, 2013 at 6:53 AM, Oleg Dulin  wrote:
>
>> Hi guys!
>>
>> I beleive my limits settings are correct. Here is the output of "ulimits
>> -a":
>>
>> core file size  (blocks, -c) 0
>> data seg size   (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size   (blocks, -f) unlimited
>> pending signals (-i) 1547135
>> max locked memory   (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files  (-n) 10
>> pipe size(512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority  (-r) 0
>> stack size  (kbytes, -s) 8192
>> cpu time   (seconds, -t) unlimited
>> max user processes  (-u) 32768
>> virtual memory  (kbytes, -v) unlimited
>> file locks  (-x) unlimited
>>
>> However,  I just had a couple of cassandra nodes go down over the weekend
>> for no apparent reason with the following error:
>>
>> java.lang.OutOfMemoryError: unable to create new native thread
>>at java.lang.Thread.start0(Native Method)
>>at java.lang.Thread.start(Thread.java:691)
>>at 
>> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
>>
>>at 
>> java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017)
>>
>>at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163)
>>
>>at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>>at java.lang.Thread.run(Thread.java:722)
>>
>> Any input is greatly appreciated.
>> --
>> Regards,
>> Oleg Dulin
>> http://www.olegdulin.com
>>
>>
>>
>


Cyclop - CQL3 web based editor

2013-12-11 Thread Maciej Miklas
Hi all,

This is the Cassandra mailing list, but I've developed something that is
strictly related to Cassandra, and some of you might find it useful, so
I've decided to send email to this group.

This is web based CQL3 editor. The idea is, to deploy it once and have
simple and comfortable CQL3 interface over web - without need to install
anything.

The editor itself supports code completion, not only based on CQL syntax,
but also based database content - so for example the select statement will
suggest tables from active keyspace, or in where closure only columns from
table provided after "select from"

The results are displayed in reversed table - rows horizontally and columns
vertically. It seems to be more natural for column oriented database.

You can also export query results to CSV, or add query as browser bookmark.

The whole application is based on wicket + bootstrap + spring and can be
deployed in any web 3.0 container.

Here is the project (open source): https://github.com/maciejmiklas/cyclop


Have a fun!
 Maciej


Re: Cassandra 1.1.5 - SerializingCacheProvider - possible memory leak?

2012-12-03 Thread Maciej Miklas
Size and Capacity are in bytes. The RAM is consumed right after Cassandra
start (3GB heap) - the reason for this could be 400.000.000 rows on single
node, serialized bloom filters take 1,2 GB HDD space.


On Mon, Dec 3, 2012 at 10:14 AM, Maciej Miklas  wrote:

> Hi,
>
> I have following Cassandra setup on server with 24GB RAM:
>
> *cassandra-env.sh*
> MAX_HEAP_SIZE="6G"
> HEAP_NEWSIZE="500M"
>
> *cassandra.yaml*
> key_cache_save_period: 0
> row_cache_save_period: 0
> key_cache_size_in_mb: 512
> row_cache_size_in_mb: 10240
> row_cache_provider: SerializingCacheProvider
>
>
> I'am getting OutOfMemory errors, and VisulalVM shows that Old Gen takes
> nearly whole heap.
>
> Those are the Cassandra log messages:
>
> INFO CLibrary JNA mlockall successful
> INFO DatabaseDescriptor DiskAccessMode 'auto' determined to be mmap,
> indexAccessMode is mmap
> INFO DatabaseDescriptor Global memtable threshold is enabled at 1981 MB
> INFO CacheService Initializing key cache with capacity of 512 MBs.
> INFO CacheService Scheduling key cache save to each 0 seconds (going to
> save all keys).
> INFO CacheService Initializing row cache with capacity of 10240 MBs and
> provider org.apache.cassandra.cache.SerializingCacheProvider
> INFO CacheService Scheduling row cache save to each 0 seconds (going to
> save all keys).
> .
> INFO GCInspector GC for ConcurrentMarkSweep: 1106 ms for 1 collections,
> 5445489440 used; max is 6232735744
> 
> INFO StatusLogger Cache Type Size
> Capacity   KeysToSave
>   Provider
> INFO StatusLogger KeyCache 831 782
> 831 782 all
>
> INFO StatusLogger RowCache  196404489
>  196404688  all
>  org.apache.cassandra.cache.SerializingCacheProvider
> .
> INFO StatusLogger ColumnFamily   Memtable ops, data
> INFO StatusLogger MyCF1 192828,66056113
> INFO StatusLogger MyCF2 59913,19535021
> INFO StatusLogger MyCF3 124953,59082091
> 
> WARN [ScheduledTasks:1] GCInspector.java Heap is 0.8623632454134093 full.
>  You may need to reduce memtable and/or cache sizes.  Cassandra will now
> flush up to the two largest memtables to free up memory.  Adjust
> flush_largest_memtables_at threshold in cassandra.yaml if you don't want
> Cassandra to do this automatically
>
>
>
> 1) I've set row cache size to 10GB. Single row needs in serialized form
> between 300-500 bytes, this would allow maximum 20 millions row key
> entries.
> SerializingCacheProvider reports size of 196 millions, how can I
> interpret this number?
>
> 2) I am using default settings besides changes described above. Since key
> cache is small, and off heap cache is active, what is taking space in Old
> Gen?
>
>
> Thanks,
> Maciej
>
>
>
>
>
>


Re: What is the ideal server-side technology stack to use with Cassandra?

2012-08-18 Thread Maciej Miklas
I'am using Java + Tomcat + Spring + Hector  on Lunux - I works as always
just great.

It is also not bad idea to mix databases - Cassandra is not always solution
for every problem, Cassandra + Mongo could be ;)

On Fri, Aug 17, 2012 at 7:54 PM, Aaron Turner  wrote:

> My stack:
>
> Java + JRuby + Rails + Torquebox
>
> I'm using the Hector client (arguably the most mature out there) and
> JRuby+RoR+Torquebox gives me a great development platform which really
> scales (full native thread support for example) and is extremely
> powerful.  Honestly I expect, all my future RoR apps will be built on
> JRuby/Torquebox because I've been so happy with it even if I don't
> have a specific need to utilize Java libraries from inside the app.
>
> And the best part is that I've yet to have to write a single line of Java!
> :)
>
>
>
> On Fri, Aug 17, 2012 at 6:53 AM, Edward Capriolo 
> wrote:
> > The best stack is the THC stack. :)
> >
> > Tomcat Hadoop Cassandra :)
> >
> > On Fri, Aug 17, 2012 at 6:09 AM, Andy Ballingall TF
> >  wrote:
> >> Hi,
> >>
> >> I've been running a number of tests with Cassandra using a couple of
> >> PHP drivers (namely PHPCassa (https://github.com/thobbs/phpcassa/) and
> >> PDO-cassandra (
> http://code.google.com/a/apache-extras.org/p/cassandra-pdo/),
> >> and the experience hasn't been great, mainly because I can't try out
> >> the CQL3.
> >>
> >> Aaron Morton (aa...@thelastpickle.com) advised:
> >>
> >> "If possible i would avoid using PHP. The PHP story with cassandra has
> >> not been great in the past. There is little love for it, so it takes a
> >> while for work changes to get in the client drivers.
> >>
> >> AFAIK it lacks server side states which makes connection pooling
> >> impossible. You should not pool cassandra connections in something
> >> like HAProxy."
> >>
> >> So my question is - if you were to build a new scalable project from
> >> scratch tomorrow sitting on top of Cassandra, which technologies would
> >> you select to serve HTTP requests to ensure you get:
> >>
> >> a) The best support from the cassandra community (e.g. timely updates
> >> of drivers, better stability)
> >> b) Optimal efficiency between webservers and cassandra cluster, in
> >> terms of the performance of individual requests and in the volumes of
> >> connections handled per second
> >> c) Ease of development and and deployment.
> >>
> >> What worked for you, and why? What didn't work for you?
>
> --
> Aaron Turner
> http://synfin.net/ Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
> Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
> "carpe diem quam minimum credula postero"
>


Re: Understanding UnavailableException

2012-08-17 Thread Maciej Miklas
UnavailableException is bit tricky. It means, that not all replicas
required by CL received update. Actually you do not know, whenever update
was stored or not, and actually what went wrong.

This is the case, why writing with CL.ALL might get problematic. It is
enough, that only one replica is off-line and you will get exception.
Remember also, that CL.ALL means, all replicas in all Data Centers - not
only local DC. Writing with QUORUM_LOCAL could be better idea.

There is only one CL, where exception guarantees, that data was really not
stored: CL.ANY with hinted handoff enabled.

One more thing: write goes always to all replicas independent from provided
CL. Client request blocks only until required replicas respond - however
this response is asynchronous. This means, when you write with lower CL,
replicas will get data with the same speed, only your client does not wait
for acknowledgment from all of them.

Ciao,
Maciej


On Fri, Aug 17, 2012 at 11:07 AM, Mohit Agarwal wrote:

> Hi guys,
>
> I am trying to understand what happens when an UnavailableException is
> thrown.
>
> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
> My understanding is that if one of the nodes is down and the coordinator
> node is aware of that(through gossip), then it will respond to the request
> with an UnavailableException. Is this correct?
>
> b) What happens if the coordinator isn't aware of a node being down and
> sends the request to all the nodes and never hears back from one of the
> node. Would this result in a TimedOutException or a UnavailableException?
>
> c) I am trying to understand the cases where the client receives an error,
> but data could have been inserted into Cassandra. One such case is the
> TimedOutException. Are there any other situations like these?
>
> Thanks,
> Mohit
>


Re: SSTable Index and Metadata - are they cached in RAM?

2012-08-17 Thread Maciej Miklas
Great articles, I did not find those before !
*
SSTable Index - yes I mean column Index.

*I would like to understand, how many disk seeks might be required to find
column in single SSTable.

I am assuming positive bloom filter on row key. Now Cassandra needs to find
out whenever given SSTable contains column name, and this might require few
disk seeks:
1) Check key cache, if found go to 5)
2) Rad from disk all row keys, in order to find one (binary search)
3) Found row key contains disk offset to its column index
4) Read from disk column index for our row key. Index contains also bloom
filter on column names
5) Use bloom filter on column name, to find out whenever this SSTable might
contain our column
6) Read column to finally make sure that is exists

As I understand, in the worst case, we can have three disk seeks (2, 4, 6)
pro SSTable in order to check whenever it contains given column, it that
correct ?

I would expect, that sorted row keys (from point 2) ) already contain bloom
filter for their columns. But bloom filter is stored together with column
index, is that correct?


Cheers,
Maciej

On Fri, Aug 17, 2012 at 12:06 AM, aaron morton wrote:

> What about SSTable index,
>
> Not sure what you are referring to there. Each row has a in a SStable has
> a bloom filter and may have an index of columns. This is not cached.
>
> See http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ or
> http://www.slideshare.net/aaronmorton/cassandra-sf-2012-technical-deep-dive-query-performance
>
>  and Metadata?
>
> This is the meta data we hold in memory for every open sstable
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java
>
> Cheers
>
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 16/08/2012, at 7:34 PM, Maciej Miklas  wrote:
>
> Hi all,
>
> bloom filter for row keys is always in RAM. What about SSTable index, and
> Metadata?
>
> Is it cached by Cassandra, or it relays on memory mapped files?
>
>
> Thanks,
> Maciej
>
>
>


SSTable Index and Metadata - are they cached in RAM?

2012-08-16 Thread Maciej Miklas
Hi all,

bloom filter for row keys is always in RAM. What about SSTable index, and
Metadata?

Is it cached by Cassandra, or it relays on memory mapped files?


Thanks,
Maciej


Cassandra 1.0 - is disk seek required to access SSTable metadata

2012-08-08 Thread Maciej Miklas
Hi all,

older Cassandra versions had to read columns from each SSTable with
positive bloom filter in order to find recent value.
This was optimized with: Improve read performance in update-intensive
workload 
Now each SSTable has metadata - SSTableMetadata.

Bloom filter is stored in RAM, but what about metadata?
Is disk seek required to access it?

Thanks,
Maciej


CQL 3.0 - UPDATE Statement - how it works?

2012-04-25 Thread Maciej Miklas
CQL will have UPDATE future, I am trying to understand how this could work.

Every write is an append to SSTable, UPDATE would need to change data, but
only if it exists, and this is problematic, since we have distributed
system.

Is UPDATE special kind of insert, which changes given data only if it
already exists? Will be UPDATE resolved first during read operation
(SSTable merge)?


Thanks,
Maciej


Cassandra 1.1 - conflict resolution - any changes ?

2012-04-25 Thread Maciej Miklas
Hi,

I've seen this blog entry:
http://www.datastax.com/dev/blog/schema-in-cassandra-1-1 and I am trying to
understand, how could Cassandra support PRIMARY KEY.

Cassandra has silent conflict resolution, where each insert overwrites next
one, and there are only inserts and deletes - no updates. The last data
version is being resolved first during read - as the latest entry from all
corresponding SS Tables.
Is this still correct in Cassandra 1.1?


Thanks,
Maciej


Re: Schema advice/help

2012-03-28 Thread Maciej Miklas
correct - I see also no other solution for this problem

On Thu, Mar 29, 2012 at 1:46 AM, Guy Incognito  wrote:

>  well, no.  my assumption is that he knows what the 5 itemTypes (or
> appropriate corresponding ids) are, so he can do a known 5-rowkey lookup.
> if he does not know, then agreed, my proposal is not a great fit.
>
> could do (as originally suggested)
>
> userId -> itemType:activityId
>
> if you want to keep everything in the same row (again assumes that you
> know what the itemTypes are).  but then you can't really do a multiget, you
> have to do 5 separate slice queries, one for each item type.
>
> can also do some wacky stuff around maintaining a row that explicitly only
> holds the last 10 items by itemType (meaning you have to delete the oldest
> one everytime you insert a new one), but that prolly requires read-on-write
> etc and is a lot messier.  and you will prolly need to worry about the case
> where you (transiently) have more than 10 'latest' items for a single
> itemType.
>
> On 28/03/2012 09:49, Maciej Miklas wrote:
>
> yes - but anyway in your example you need "key range quey" and that
> requires OOP, right?
>
> On Tue, Mar 27, 2012 at 5:13 PM, Guy Incognito  wrote:
>
>>  multiget does not require OPP.
>>
>> On 27/03/2012 09:51, Maciej Miklas wrote:
>>
>> multiget would require Order Preserving Partitioner, and this can lead to
>> unbalanced ring and hot spots.
>>
>> Maybe you can use secondary index on "itemtype" - is must have small
>> cardinality:
>> http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/
>>
>>
>>
>> On Tue, Mar 27, 2012 at 10:10 AM, Guy Incognito wrote:
>>
>>> without the ability to do disjoint column slices, i would probably use 5
>>> different rows.
>>>
>>> userId:itemType -> activityId
>>>
>>> then it's a multiget slice of 10 items from each of your 5 rows.
>>>
>>>
>>> On 26/03/2012 22:16, Ertio Lew wrote:
>>>
>>>> I need to store activities by each user, on 5 items types. I always
>>>> want to read last 10 activities on each item type, by a user (ie, total
>>>> activities to read at a time =50).
>>>>
>>>> I am wanting to store these activities in a single row for each user so
>>>> that they can be retrieved in single row query, since I want to read all
>>>> the last 10 activities on each item.. I am thinking of creating composite
>>>> names appending "itemtype" : "activityId"(activityId is just timestamp
>>>> value) but then, I don't see about how to read the last 10 activities from
>>>> all itemtypes.
>>>>
>>>> Any ideas about schema to do this better way ?
>>>>
>>>
>>>
>>
>>
>
>


Re: Schema advice/help

2012-03-28 Thread Maciej Miklas
yes - but anyway in your example you need "key range quey" and that
requires OOP, right?

On Tue, Mar 27, 2012 at 5:13 PM, Guy Incognito  wrote:

>  multiget does not require OPP.
>
> On 27/03/2012 09:51, Maciej Miklas wrote:
>
> multiget would require Order Preserving Partitioner, and this can lead to
> unbalanced ring and hot spots.
>
> Maybe you can use secondary index on "itemtype" - is must have small
> cardinality:
> http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/
>
>
>
> On Tue, Mar 27, 2012 at 10:10 AM, Guy Incognito  wrote:
>
>> without the ability to do disjoint column slices, i would probably use 5
>> different rows.
>>
>> userId:itemType -> activityId
>>
>> then it's a multiget slice of 10 items from each of your 5 rows.
>>
>>
>> On 26/03/2012 22:16, Ertio Lew wrote:
>>
>>> I need to store activities by each user, on 5 items types. I always want
>>> to read last 10 activities on each item type, by a user (ie, total
>>> activities to read at a time =50).
>>>
>>> I am wanting to store these activities in a single row for each user so
>>> that they can be retrieved in single row query, since I want to read all
>>> the last 10 activities on each item.. I am thinking of creating composite
>>> names appending "itemtype" : "activityId"(activityId is just timestamp
>>> value) but then, I don't see about how to read the last 10 activities from
>>> all itemtypes.
>>>
>>> Any ideas about schema to do this better way ?
>>>
>>
>>
>
>


Re: Schema advice/help

2012-03-27 Thread Maciej Miklas
multiget would require Order Preserving Partitioner, and this can lead to
unbalanced ring and hot spots.

Maybe you can use secondary index on "itemtype" - is must have small
cardinality:
http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/



On Tue, Mar 27, 2012 at 10:10 AM, Guy Incognito  wrote:

> without the ability to do disjoint column slices, i would probably use 5
> different rows.
>
> userId:itemType -> activityId
>
> then it's a multiget slice of 10 items from each of your 5 rows.
>
>
> On 26/03/2012 22:16, Ertio Lew wrote:
>
>> I need to store activities by each user, on 5 items types. I always want
>> to read last 10 activities on each item type, by a user (ie, total
>> activities to read at a time =50).
>>
>> I am wanting to store these activities in a single row for each user so
>> that they can be retrieved in single row query, since I want to read all
>> the last 10 activities on each item.. I am thinking of creating composite
>> names appending "itemtype" : "activityId"(activityId is just timestamp
>> value) but then, I don't see about how to read the last 10 activities from
>> all itemtypes.
>>
>> Any ideas about schema to do this better way ?
>>
>
>


Re: Cassandra - crash with “free() invalid pointer”

2012-03-26 Thread Maciej Miklas
I have row cache - it's about 20GB big in this case.
The problem can be reproduced with our load test - we are using 20 reader
threads on single Cassandra node.

I will retest it with Java 6 - still it looks to me like JNA problem and
JDK in this case should not matter, but we will see.


On Thu, Mar 22, 2012 at 8:27 PM, Benoit Perroud  wrote:

> Sounds like a race condition in the off heap caching while calling
> Unsafe.free().
>
> Do you use cache ? What is your use case when you encounter this error
> ? Are you able to reproduce it ?
>
>
> 2012/3/22 Maciej Miklas :
> > Hi *,
> >
> > My Cassandra installation runs on flowing system:
> >
> > Linux with Kernel 2.6.32.22
> > jna-3.3.0
> > Java 1.7.0-b147
> >
> > Sometimes we are getting following error:
> >
> > *** glibc detected *** /var/opt/java1.7/bin/java: free(): invalid
> pointer:
> > 0x7f66088a6000 ***
> > === Backtrace: =
> > /lib/libc.so.6[0x7f661d7099a8]
> > /lib/libc.so.6(cfree+0x76)[0x7f661d70bab6]
> > /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x59)[0x7f661e02f349]
> > /lib/libpthread.so.0[0x7f661de09237]
> > /lib/libpthread.so.0[0x7f661de0931a]
> > /lib/libpthread.so.0[0x7f661de0a0bd]
> > /lib/libc.so.6(clone+0x6d)[0x7f661d76564d]
> > === Memory map: 
> > 0040-00401000 r-xp  68:07 537448203
> > /var/opt/jdk1.7.0/bin/java
> > 0060-00601000 rw-p  68:07 537448203
> > /var/opt/jdk1.7.0/bin/java
> > 01bae000-01fd rw-p  00:00 0
> > [heap]
> > 01fd-15798000 rw-p  00:00 0
> > [heap]
> > 40002000-40005000 ---p  00:00 0
> > 40005000-40023000 rw-p  00:00 0
> > 4003-40033000 ---p  00:00 0
> > 40033000-40051000 rw-p  00:00 0
> >
> > Does anyone have similar problems? or maybe some hints?
> >
> > Thanks,
> > Maciej
>
>
>
> --
> sent from my Nokia 3210
>


Re: Cassandra - crash with “free() invalid pointer”

2012-03-26 Thread Maciej Miklas
thanks - I will try it

On Thu, Mar 22, 2012 at 10:15 PM, Ben Coverston
wrote:

> Use a version of the Java 6 runtime, Cassandra hasn't been tested at all
> with the Java 7 runtime.
>
>
> On Thu, Mar 22, 2012 at 1:27 PM, Benoit Perroud wrote:
>
>> Sounds like a race condition in the off heap caching while calling
>> Unsafe.free().
>>
>> Do you use cache ? What is your use case when you encounter this error
>> ? Are you able to reproduce it ?
>>
>>
>> 2012/3/22 Maciej Miklas :
>> > Hi *,
>> >
>> > My Cassandra installation runs on flowing system:
>> >
>> > Linux with Kernel 2.6.32.22
>> > jna-3.3.0
>> > Java 1.7.0-b147
>> >
>> > Sometimes we are getting following error:
>> >
>> > *** glibc detected *** /var/opt/java1.7/bin/java: free(): invalid
>> pointer:
>> > 0x7f66088a6000 ***
>> > === Backtrace: =
>> > /lib/libc.so.6[0x7f661d7099a8]
>> > /lib/libc.so.6(cfree+0x76)[0x7f661d70bab6]
>> > /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x59)[0x7f661e02f349]
>> > /lib/libpthread.so.0[0x7f661de09237]
>> > /lib/libpthread.so.0[0x7f661de0931a]
>> > /lib/libpthread.so.0[0x7f661de0a0bd]
>> > /lib/libc.so.6(clone+0x6d)[0x7f661d76564d]
>> > === Memory map: 
>> > 0040-00401000 r-xp  68:07 537448203
>> > /var/opt/jdk1.7.0/bin/java
>> > 0060-00601000 rw-p  68:07 537448203
>> > /var/opt/jdk1.7.0/bin/java
>> > 01bae000-01fd rw-p  00:00 0
>> > [heap]
>> > 01fd-15798000 rw-p  00:00 0
>> > [heap]
>> > 40002000-40005000 ---p  00:00 0
>> > 40005000-40023000 rw-p  00:00 0
>> > 4003-40033000 ---p  00:00 0
>> > 40033000-40051000 rw-p  00:00 0
>> >
>> > Does anyone have similar problems? or maybe some hints?
>> >
>> > Thanks,
>> > Maciej
>>
>>
>>
>> --
>> sent from my Nokia 3210
>>
>
>
>
> --
> Ben Coverston
> DataStax -- The Apache Cassandra Company
>
>


Cassandra - crash with “free() invalid pointer”

2012-03-22 Thread Maciej Miklas
Hi *,

My Cassandra installation runs on flowing system:

   - Linux with Kernel 2.6.32.22
   - jna-3.3.0
   - Java 1.7.0-b147

Sometimes we are getting following error:

*** glibc detected *** /var/opt/java1.7/bin/java: free(): invalid
pointer: 0x7f66088a6000 ***
=== Backtrace: =
/lib/libc.so.6[0x7f661d7099a8]
/lib/libc.so.6(cfree+0x76)[0x7f661d70bab6]
/lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x59)[0x7f661e02f349]
/lib/libpthread.so.0[0x7f661de09237]
/lib/libpthread.so.0[0x7f661de0931a]
/lib/libpthread.so.0[0x7f661de0a0bd]
/lib/libc.so.6(clone+0x6d)[0x7f661d76564d]
=== Memory map: 
0040-00401000 r-xp  68:07 537448203
  /var/opt/jdk1.7.0/bin/java
0060-00601000 rw-p  68:07 537448203
  /var/opt/jdk1.7.0/bin/java
01bae000-01fd rw-p  00:00 0  [heap]
01fd-15798000 rw-p  00:00 0  [heap]
40002000-40005000 ---p  00:00 0
40005000-40023000 rw-p  00:00 0
4003-40033000 ---p  00:00 0
40033000-40051000 rw-p  00:00 0
Does anyone have similar problems? or maybe some hints?

Thanks,
Maciej


Cassandra as Database for Role Based Access Control System

2012-03-19 Thread Maciej Miklas
Hi *,

I would like to know your opinion about using Cassandra to implement a
RBAC-like authentication & authorization model. We have simplified the
central relationship of the general model (
http://en.wikipedia.org/wiki/Role-based_access_control) to:

user ---n:m--- role ---n:m--- resource

user(s) and resource(s) are indexed with externally visible identifiers.
These identifiers need to be "re-ownable" (think: mail aliases), too.

The main reason to consider Cassandra is the availability, scalability and
(global) geo-redundancy. This is hard to achieve with a RBDMS.

On the other side, RBAC has many m:n relations. While some inconsistencies
may be acceptable, resource ownership (i.e. role=owner) must never ever be
mixed up.

What do you think? Is such relational model an antipattern for Cassandra
usage? Do you know similar solutions based on Cassandra?


Regards,

Maciej


ps. I've posted this question also on stackoverflow, but I would like to
also get feedback from Cassandra community.


Re: hector connection pool

2012-03-05 Thread Maciej Miklas
Have you tried to change:
me.prettyprint.cassandra.service.CassandraHostConfigurator#retryDownedHostsDelayInSeconds
?

Hector will ping down hosts every xx seconds and recover connection.

Regards,
Maciej

On Mon, Mar 5, 2012 at 8:13 PM, Daning Wang  wrote:

> I just got this error ": All host pools marked down. Retry burden pushed
> out to client." in a few clients recently, client could not  recover, we
> have to restart client application.  we are using 0.8.0.3 hector.
>
> At that time we did compaction  for a CF, it takes several hours, server
> was busy. But I think client should recover after server load was down.
>
> Any bug reported about this? I did search but could not find one.
>
> Thanks,
>
> Daning
>
>


Cassandra cache patterns with thiny and wide rows

2012-03-05 Thread Maciej Miklas
I've asked this question already on stackoverflow but without answer - I
wll try again:


My use case expects heavy read load - there are two possible model design
strategies:

   1.

   Tiny rows with row cache: In this case row is small enough to fit into
   RAM and all columns are being cached. Read access should be fast.
   2.

   Wide rows with key cache. Wide rows with large columns amount are to big
   for row cache. Access to column subset requires HDD seek.

As I understand using wide rows is a good design pattern. But we would need
to disable row cache - so  what is the benefit of such wide row (at
least for read access)?

Which approach is better 1 or 2?


Cassandra - row range and column slice

2012-02-17 Thread Maciej Miklas
Hallo,

assuming Ordered Partitioner I would like to have possibility to find
records by row key range and columns by slice - for example:

Give me all rows between 2001 and 2003 and all columns between A and C.

For such data:
{
  2001: {A:"v1", Z:"v2"},
  2002: {R:"v2", Z:"v3"},
  2003: {C:"v4", Z:"v5"},
  2004: {A:"v1",B:"v33", Z:"v2"}
}
Result would be:
  2001: {A:"v1""},
  2003: {C:"v4"}

Is such multi-slice query possible with Cassandra?  Are there any
performance Issues (bedsides unbalanced cluster)?

Thanks,
Maciej


Re: Data Model Design for Login Servie

2011-11-20 Thread Maciej Miklas
I will follow exactly this solution - thanks :)

On Fri, Nov 18, 2011 at 9:53 PM, David Jeske  wrote:

> On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas 
> wrote:
>
>> A) Skinny rows
>>  - row key contains login name - this is the main search criteria
>>  - login data is replicated - each possible login is stored as single row
>> which contains all user data - 10 logins for
>
> single customer create 10 rows, where each row has different key and the
>> same content
>>
>
> To me this seems reasonable. Remember, because of your replication of the
> datavalues you will want a quick way to find all the logins for a given ID,
> so you will also want to store a separate dataset like:
>
> 1122 {
>  alfred.tes...@xyz.de =1(where the login is a column key)
>  alf...@aad.de =1
> }
>
> When you do an update, you'll need to fetch the entire row for the
> user-id, and then update all copies of the data. THis can create problems,
> if the data is out of sync (which it will be at certain times because of
> eventual consistency, and might be if something bad happens).
>
> ...the other option, of course, is to make a login-name indirection. You
> would have only one copy of the user-data stored by ID, and then you would
> store a separate mapping from login-name-to-ID. Of course this would
> require two roundtrips to get the user information from login-id, which is
> something I know you said you didn't want to do.
>
>
>


Re: Data Model Design for Login Servie

2011-11-17 Thread Maciej Miklas
but secondary index is limited only to repeating values like enums. In my
case I would have performance issue. right?

On 18.11.2011, at 02:08, Maxim Potekhin  wrote:

 1122: {
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
  alias1: alfred.tes...@xyz.de
  alias2: alf...@aad.de
  alias3: a...@dd.de
 }

...and you can use secondary indexes to query on anything.

Maxim


On 11/17/2011 4:08 PM, Maciej Miklas wrote:

Hallo all,

I need your help to design structure for simple login service. It contains
about 100.000.000 customers and each one can have about 10 different logins
- this results 1.000.000.000 different logins.

Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required
login data in single call. Also usually we expect low write traffic and
heavy read traffic - round trips for reading data should be avoided.
Below I've described two possible cassandra data models based on example:
we have two users, first user has two logins and second user has three
logins

A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row
which contains all user data - 10 logins for single customer create 10
rows, where each row has different key and the same content

// first 3 rows has different key and the same replicated data
alfred.tes...@xyz.de {
  id: 1122
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
alf...@aad.de {
  id: 1122
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
a...@dd.de {
  id: 1122
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
},

// two following rows has again the same data for second customer
manf...@xyz.de {
  id: 1133
  gender: MALE
  birthdate: 1997.02.01
  name: Manfredus Maximus
  pwd: e44c504ff16c8fcd2fe8c74bb492adda
},
rober...@xyz.de {
  id: 1133
  gender: MALE
  birthdate: 1997.02.01
  name: Manfredus Maximus
  pwd: e44c504ff16c8fcd2fe8c74bb492adda
}

B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a'
contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have
positive performance impact (??)
- to avoid super columns each row contains directly columns, where column
name is the user login and column value is corresponding data in kind of
serialized form (I would like to have is human readable)

a {
alfred.tes...@xyz.de:"1122;MALE;1987.11.09;
 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

alf...@aad.de@xyz.de:"1122;MALE;1987.11.09;
 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

a...@dd.de@xyz.de:"1122;MALE;1987.11.09;
 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa"
  },

m {
manf...@xyz.de:"1133;MALE;1997.02.01;
  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
  },

r {
rober...@xyz.de:"1133;MALE;1997.02.01;
  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"

  }

Which solution is better, especially for better read performance? Do you
have better idea?

Thanks,
Maciej


Data Model Design for Login Servie

2011-11-17 Thread Maciej Miklas
Hallo all,

I need your help to design structure for simple login service. It contains
about 100.000.000 customers and each one can have about 10 different logins
- this results 1.000.000.000 different logins.

Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required
login data in single call. Also usually we expect low write traffic and
heavy read traffic - round trips for reading data should be avoided.
Below I've described two possible cassandra data models based on example:
we have two users, first user has two logins and second user has three
logins

A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row
which contains all user data - 10 logins for single customer create 10
rows, where each row has different key and the same content

// first 3 rows has different key and the same replicated data
alfred.tes...@xyz.de {
  id: 1122
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
alf...@aad.de {
  id: 1122
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
},
a...@dd.de {
  id: 1122
  gender: MALE
  birthdate: 1987.11.09
  name: Alfred Tester
  pwd: e72c504dc16c8fcd2fe8c74bb492affa
},

// two following rows has again the same data for second customer
manf...@xyz.de {
  id: 1133
  gender: MALE
  birthdate: 1997.02.01
  name: Manfredus Maximus
  pwd: e44c504ff16c8fcd2fe8c74bb492adda
},
rober...@xyz.de {
  id: 1133
  gender: MALE
  birthdate: 1997.02.01
  name: Manfredus Maximus
  pwd: e44c504ff16c8fcd2fe8c74bb492adda
}

B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a'
contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have
positive performance impact (??)
- to avoid super columns each row contains directly columns, where column
name is the user login and column value is corresponding data in kind of
serialized form (I would like to have is human readable)

a {
alfred.tes...@xyz.de:"1122;MALE;1987.11.09;
 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

alf...@aad.de@xyz.de:"1122;MALE;1987.11.09;
 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

a...@dd.de@xyz.de:"1122;MALE;1987.11.09;
 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa"
  },

m {
manf...@xyz.de:"1133;MALE;1997.02.01;
  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
  },

r {
rober...@xyz.de:"1133;MALE;1997.02.01;
  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"

  }

Which solution is better, especially for better read performance? Do you
have better idea?

Thanks,
Maciej


Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?

2011-11-10 Thread Maciej Miklas
I would like to know it also - actually is should be similar, plus there
are no dependencies to sun.misc packages.

Regards,
Maciej

On Thu, Nov 10, 2011 at 1:46 PM, Benoit Perroud  wrote:

> Thanks for the answer.
> I saw the move to sun.misc.
> In what sense allocateDirect is broken ?
>
> Thanks,
>
> Benoit.
>
>
> 2011/11/9 Jonathan Ellis :
> > allocateDirect is broken for this purpose, but we removed the JNA
> > dependency using sun.misc.Unsafe instead:
> > https://issues.apache.org/jira/browse/CASSANDRA-3271
> >
> > On Wed, Nov 9, 2011 at 5:54 AM, Benoit Perroud 
> wrote:
> >> Hi,
> >>
> >> I wonder if you have already discussed about ByteBuffer.allocateDirect
> >> alternative to JNA memory allocation ?
> >>
> >> If so, do someone mind send me a pointer ?
> >>
> >> Thanks !
> >>
> >> Benoit.
> >>
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
> >
>
>
>
> --
> sent from my Nokia 3210
>


Re: Cassandra 1.x and proper JNA setup

2011-11-04 Thread Maciej Miklas
Super - thank you for help :)

On Thu, Nov 3, 2011 at 6:55 PM, Jonathan Ellis  wrote:

> Relying on that was always a terrible idea because you could easily
> OOM before it could help.  There's no substitute for "don't make the
> caches too large" in the first place.
>
> We're working on https://issues.apache.org/jira/browse/CASSANDRA-3143
> to make cache sizing easier.
>
> On Thu, Nov 3, 2011 at 3:16 AM, Maciej Miklas 
> wrote:
> > According to source code, JNA is being used to call malloc and free. In
> this
> > case each cached row will be serialized into RAM.
> > We must be really careful when defining cache size - to large size would
> > cause out of memory. Previous Cassandra releases has logic that would
> > decrease cache size if heap is low.
> > Currently each row will be serialized without any memory limit checks -
> > assuming that I understood it right.
> >
> > Those properties:
> >reduce_cache_sizes_at: 0.85
> >reduce_cache_capacity_to: 0.6
> > are not used anymore - at least not when JNA is enabled, witch is default
> > from Cassandra 1.0
> >
> >
> > On Wed, Nov 2, 2011 at 1:53 PM, Maciej Miklas  >
> > wrote:
> >>
> >> I've just found, that JNA will be not used from 1.1 release -
> >> https://issues.apache.org/jira/browse/CASSANDRA-3271
> >> I would be also nice to know what was the reason for this decision.
> >>
> >> Regards,
> >> Maciej
> >>
> >> On Wed, Nov 2, 2011 at 1:34 PM, Viktor Jevdokimov
> >>  wrote:
> >>>
> >>> Up, also interested in answers to questions below.
> >>>
> >>>
> >>> Best regards/ Pagarbiai
> >>>
> >>> Viktor Jevdokimov
> >>> Senior Developer
> >>>
> >>> Email: viktor.jevdoki...@adform.com
> >>> Phone: +370 5 212 3063
> >>> Fax: +370 5 261 0453
> >>>
> >>> J. Jasinskio 16C,
> >>> LT-01112 Vilnius,
> >>> Lithuania
> >>>
> >>>
> >>>
> >>> Disclaimer: The information contained in this message and attachments
> is
> >>> intended solely for the attention and use of the named addressee and
> may be
> >>> confidential. If you are not the intended recipient, you are reminded
> that
> >>> the information remains the property of the sender. You must not use,
> >>> disclose, distribute, copy, print or rely on this e-mail. If you have
> >>> received this message in error, please contact the sender immediately
> and
> >>> irrevocably delete this message and any copies.-Original
> Message-
> >>> From: Maciej Miklas [mailto:mac.mik...@googlemail.com]
> >>> Sent: Tuesday, November 01, 2011 11:15
> >>> To: user@cassandra.apache.org
> >>> Subject: Cassandra 1.x and proper JNA setup
> >>>
> >>> Hi all,
> >>>
> >>> is there any documentation about proper JNA configuration?
> >>>
> >>> I do not understand few things:
> >>>
> >>> 1) Does JNA use JVM heap settings?
> >>>
> >>> 2) Do I need to decrease max heap size while using JNA?
> >>>
> >>> 3) How do I limit RAM allocated by JNA?
> >>>
> >>> 4) Where can I see / monitor row cache size?
> >>>
> >>> 5) I've configured JNA just for test on my dev computer and so far I've
> >>> noticed serious performance issues (high cpu usage on heavy write
> load), so
> >>> I must be doing something wrong I've just copied JNA jars into
> >>> Cassandra/lib, without installing any native libs. This should not
> work at
> >>> all, right?
> >>>
> >>> Thanks,
> >>> Maciej
> >>>
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: Cassandra 1.x and proper JNA setup

2011-11-03 Thread Maciej Miklas
According to source code, JNA is being used to call malloc and free. In
this case each cached row will be serialized into RAM.
We must be really careful when defining cache size - to large size would
cause out of memory. Previous Cassandra releases has logic that would
decrease cache size if heap is low.
Currently each row will be serialized without any memory limit checks -
assuming that I understood it right.

Those properties:
   reduce_cache_sizes_at: 0.85
   reduce_cache_capacity_to: 0.6
are not used anymore - at least not when JNA is enabled, witch is default
from Cassandra 1.0


On Wed, Nov 2, 2011 at 1:53 PM, Maciej Miklas wrote:

> I've just found, that JNA will be not used from 1.1 release -
> https://issues.apache.org/jira/browse/CASSANDRA-3271
> I would be also nice to know what was the reason for this decision.
>
> Regards,
> Maciej
>
>
> On Wed, Nov 2, 2011 at 1:34 PM, Viktor Jevdokimov <
> viktor.jevdoki...@adform.com> wrote:
>
>> Up, also interested in answers to questions below.
>>
>>
>> Best regards/ Pagarbiai
>>
>> Viktor Jevdokimov
>> Senior Developer
>>
>> Email: viktor.jevdoki...@adform.com
>> Phone: +370 5 212 3063
>> Fax: +370 5 261 0453
>>
>> J. Jasinskio 16C,
>> LT-01112 Vilnius,
>> Lithuania
>>
>>
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.-Original Message-
>> From: Maciej Miklas [mailto:mac.mik...@googlemail.com]
>> Sent: Tuesday, November 01, 2011 11:15
>> To: user@cassandra.apache.org
>> Subject: Cassandra 1.x and proper JNA setup
>>
>> Hi all,
>>
>> is there any documentation about proper JNA configuration?
>>
>> I do not understand few things:
>>
>> 1) Does JNA use JVM heap settings?
>>
>> 2) Do I need to decrease max heap size while using JNA?
>>
>> 3) How do I limit RAM allocated by JNA?
>>
>> 4) Where can I see / monitor row cache size?
>>
>> 5) I've configured JNA just for test on my dev computer and so far I've
>> noticed serious performance issues (high cpu usage on heavy write load), so
>> I must be doing something wrong I've just copied JNA jars into
>> Cassandra/lib, without installing any native libs. This should not work at
>> all, right?
>>
>> Thanks,
>> Maciej
>>
>>
>


Re: Cassandra 1.x and proper JNA setup

2011-11-02 Thread Maciej Miklas
I've just found, that JNA will be not used from 1.1 release -
https://issues.apache.org/jira/browse/CASSANDRA-3271
I would be also nice to know what was the reason for this decision.

Regards,
Maciej

On Wed, Nov 2, 2011 at 1:34 PM, Viktor Jevdokimov <
viktor.jevdoki...@adform.com> wrote:

> Up, also interested in answers to questions below.
>
>
> Best regards/ Pagarbiai
>
> Viktor Jevdokimov
> Senior Developer
>
> Email: viktor.jevdoki...@adform.com
> Phone: +370 5 212 3063
> Fax: +370 5 261 0453
>
> J. Jasinskio 16C,
> LT-01112 Vilnius,
> Lithuania
>
>
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.-Original Message-
> From: Maciej Miklas [mailto:mac.mik...@googlemail.com]
> Sent: Tuesday, November 01, 2011 11:15
> To: user@cassandra.apache.org
> Subject: Cassandra 1.x and proper JNA setup
>
> Hi all,
>
> is there any documentation about proper JNA configuration?
>
> I do not understand few things:
>
> 1) Does JNA use JVM heap settings?
>
> 2) Do I need to decrease max heap size while using JNA?
>
> 3) How do I limit RAM allocated by JNA?
>
> 4) Where can I see / monitor row cache size?
>
> 5) I've configured JNA just for test on my dev computer and so far I've
> noticed serious performance issues (high cpu usage on heavy write load), so
> I must be doing something wrong I've just copied JNA jars into
> Cassandra/lib, without installing any native libs. This should not work at
> all, right?
>
> Thanks,
> Maciej
>
>


Cassandra 1.x and proper JNA setup

2011-11-01 Thread Maciej Miklas
Hi all,

is there any documentation about proper JNA configuration?

I do not understand few things:

1) Does JNA use JVM heap settings?

2) Do I need to decrease max heap size while using JNA?

3) How do I limit RAM allocated by JNA?

4) Where can I see / monitor row cache size?

5) I've configured JNA just for test on my dev computer and so far
I've noticed serious performance issues (high cpu usage on heavy write
load), so I must be doing something wrong I've just copied JNA
jars into Cassandra/lib, without installing any native libs. This
should not work at all, right?

Thanks,
Maciej


Re: Row Cache Heap Requirements (Cassandra 1.0)

2011-10-28 Thread Maciej Miklas
this is how I tested it:

1) load cache with 1.500.000 entries
2) execute fill gc
3) mesure heap size (using visual vm)
4) execute flush row cahce over cli
5) execute full gc
6) and again mesure hap usage

The difference between 6) and 3) is the heap size used by cache

On Fri, Oct 28, 2011 at 3:26 PM, Peter Schuller  wrote:

> > Is it possible, that single row (8 columns) can allocate about 2KB heap?
>
> It sounds a bit much, though not extremely so (depending on how much
> overhead there is per-column relative to per-row). Are you definitely
> looking at the live size of the heap (for example, trigger a full GC
> and look at results) and not just how much data there happens to be on
> the heap after your insertion?
>
> In any case, if you are looking for better memory efficiency I advise
> looking at the off-heap row cache
> (
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management
> ).
> It is suposed to be enabled by default. If it's not, do you have JNA
> installed?
>
> The reason I say that is that the off-heap cache stores serialized
> information on the heap rather than the full tree of Java objects. If
> off-heap caching is enabled, 2 kb/row key would be far far more than
> expected (unless I'm missing something, I've actually yet to actually
> measure it myself ;)).
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Row Cache Heap Requirements (Cassandra 1.0)

2011-10-28 Thread Maciej Miklas
Hi all,

I've tested row cache, and find out, that it requires large amount of Heap -
I would like to verify this theory.

This is my test key space:
{
 TestCF: {

row_key_1: {
{ clientKey: "MyTestCluientKey" },
{ tokenSecret: "kd94hf93k423kf44" },
{ verifier: "hfdp7dh39dks9884" },
{ callbackUrl: "http%3A%2F%2Fprinter.test.com%2Fready" },
{ accountId: "234567876545"},
{ mytestResourceId: "ADB112"},
{ dataTimestamp: "1308903420400" },
{ dataType: "ACCESS_PERMANENT"}
},
row_key_2: {
{ clientKey: "MyTestCluientKey" },
{ tokenSecret: "qdqergvhetyhvetyh" },
{ verifier: "wtrgvebyjnrnuiucewrqxcc" },
{ callbackUrl: "http%3A%2F%2Fprinter.test.com%2Fready" },
{ accountId: "23456789746534"},
{ mytestResourceId: "DQERGCWRTHB"},
{ dataTimestamp: "130890342333200" },
{ dataType: "ACCESS_LIMITED"}
},

...
row_key_x: {

},

}
}

Each row in CF: TestCF contains 8 columns. Row cache is enabled, key cache
is disabled. Row hit rate 0.99 - this is read only test.

My test loads 1.500.000 rows into cache - and this allocates about 3.5GB
heap - this is about 2KB pro single row - this is a lot

Is it possible, that single row (8 columns) can allocate about 2KB heap?


Thank you,
Maciej


Re: Cassandra as session store under heavy load

2011-10-13 Thread Maciej Miklas
durable_writes sounds great - thank you! I really do not need commit log
here.

Another question: it is possible to configure live time of Tombstones?


Regards,
Maciej


Re: Cassandra as session store under heavy load

2011-10-11 Thread Maciej Miklas
- RF is 1. We have few KeySpaces, only this one is not replicated - this
data is not that very important. In case of error customer will have to
execute process again. But again, I would like to persist it.
- Serializing data is not an option, because I would like to have
possibility to access data using console
- I will keep row cache - you are right, there is no guarantee, that my data
is still in Memtable

I will get my hardware soon (3 servers) and we will see ;) In this worst
case I will switch my session storage to memcached, and leave all other data
in Cassandra (no TTL, or very long)

Another questions:
- Using Cassandra to build something like "HTTP session store" with short
TTL is not an anti-pattern ?
- There is really no way to tell Cassandra, that particular Key Space should
be stored "mostly" in RAM and only asynchronous backup on HDD (JMS has
something like that)?


Thanks,
Maciej


Cassandra as session store under heavy load

2011-10-11 Thread Maciej Miklas
Hi *,

I would like to use Cassandra to store session related informations. I do
not have real HTTP session - it's different protocol, but the same concept.

Memcached would be fine, but I would like to additionally persist data.

Cassandra setup:

   - non replicated Key Space
   - single Column Family, where key is session ID and each column within
   row stores single key/value - (Map>)
   - column TTL = 10 minutes
   - write CL = ONE
   - read CL = ONE
   - 2.000 writes/s
   - 5.000 reads/s

Data example:

session1:{ // CF row key
   {prop1:val1, TTL:10 min},
   {prop2:val2, TTL:10 min},
.
   {propXXX:val3, TTL:10 min}
},
session2:{ // CF row key
   {prop1:val1, TTL:10 min},
   {prop2:val2, TTL:10 min},
},
..
session:{ // CF row key
   {prop1:val1, TTL:10 min},
   {prop2:val2, TTL:10 min},
}

In this case consistency is not a problem, but the performance could be,
especially disk IO.

Since data in my session leaves for short time, I would like to avoid
storing it on hard drive - except for commit log.

I have some questions:

   1. If column expires in Memtable before flushing it to SSTable, will
   Cassandra anyway store such column in SSTable (flush it to HDD)?
   2. Replication is disabled for my Key Space, in this case storing such
   expired column in SSTable would not be necessary, right?
   3. Each CF hat max 10 columns. In such case I would enable row cache and
   disable key cache. But I am expecting my data to be still available in
   Memtable, in this case I could disable whole cache, right?
   4. Any Cassandra configuration hints for such session-store use case
   would be really appreciated :)

Thank you,

Maciej