Re: Using the commit log for external synchronization

2012-09-21 Thread Ben Hood
Brian,

On Sep 22, 2012, at 1:46, "Brian O'Neill"  wrote:

>> IMHO it's a better design to multiplex the data stream at the application
>> level.
> +1, agreed.
> 
> That is where we ended up. (and Storm is proving to be a solid
> framework for that)

Thanks for the heads up, I'll check it out.

Cheers,

Ben


Re: Using the commit log for external synchronization

2012-09-21 Thread Ben Hood
Rob,

On Sep 22, 2012, at 0:39, Rob Coli  wrote:

> The above gets you most of the way there, but Aaron's point about the
> commitlog not reflecting whether the app met its CL remains true. The
> possibility that Cassandra might coalesce to a value that the
> application does not know was successfully written is one of its known
> edge cases...

Thanks for pointing out the possibility using the replay facility, though I 
think I'll take on board your observation that the CL is not guaranteed to give 
me the data I want to get (aside from the fact that you would be building a 
dependency on an internal API).

Cheers,

Ben

Re: Using Cassandra BulkOuputFormat With newer versions of Hadoop (.23+)

2012-09-21 Thread Dave Brosius
I swapped in hadoop-core-1.0.3.jar and rebuilt cassandra, without 
issues. What problems where you having?



On 09/21/2012 07:40 PM, Juan Valencia wrote:


I can't seem to get Bulk Loading to Work in newer versions of Hadoop.
since they switched JobContext from a class to an interface
You lose binary backward compatibility
Exception in thread "main" java.lang.IncompatibleClassChangeError: 
Found interface org.apache.hadoop.mapreduce.JobContext, but class was 
expected
at 
org.apache.cassandra.hadoop.BulkOutputFormat.checkOutputSpecs(BulkOutputFormat.java:42)


I tried recompiling against the newer Hadoop, but things got messy 
fast.  Has anyone done this? 




Re: Kundera 2.1 released

2012-09-21 Thread Brian O'Neill

Well done, Vivek and team!!  This release was much anticipated.

I'll give this a test with Spring Data JPA when I return from vacation.

thanks,
-brian


On Sep 21, 2012, at 9:15 PM, Vivek Mishra wrote:

> Hi All,
> 
> We are happy to announce release of Kundera 2.0.7.
> 
> Kundera is a JPA 2.0 based, object-datastore papping library for NoSQL 
> datastores. The idea behind Kundera is to make working with NoSQL Databases
> drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB and 
> relational databases.
> 
> Major Changes in this release:
> ---
> * Allow user to set specific CQL versioning.
> 
> * Batch insert/update for Cassandra/MongoDB/HBase.
> 
> * Extended JPA Metamodel/TypedQuery/ProviderUtil implementation.
> 
> * Another Thrift client implementation for Cassandra.
> 
> * Deprecated support for properties with XML based Column family/Table/server 
> specific property configuration for Cassandra, MongoDB and HBase.
> 
> * Stronger query support:
>  a) JPQL support over all data types and associations.
>  b) JPQL support to query using primary key alongwith other columns.
> 
>  * Fixed github issues:
> 
>https://github.com/impetus-opensource/Kundera/issues/90
>https://github.com/impetus-opensource/Kundera/issues/91
>https://github.com/impetus-opensource/Kundera/issues/92
>https://github.com/impetus-opensource/Kundera/issues/93
>https://github.com/impetus-opensource/Kundera/issues/94
>https://github.com/impetus-opensource/Kundera/issues/96
>https://github.com/impetus-opensource/Kundera/issues/98
>https://github.com/impetus-opensource/Kundera/issues/99
>https://github.com/impetus-opensource/Kundera/issues/100
>https://github.com/impetus-opensource/Kundera/issues/101
>https://github.com/impetus-opensource/Kundera/issues/102
>https://github.com/impetus-opensource/Kundera/issues/104
>https://github.com/impetus-opensource/Kundera/issues/106
>https://github.com/impetus-opensource/Kundera/issues/107 
>https://github.com/impetus-opensource/Kundera/issues/108
>https://github.com/impetus-opensource/Kundera/issues/109
>https://github.com/impetus-opensource/Kundera/issues/111
>https://github.com/impetus-opensource/Kundera/issues/112   
>https://github.com/impetus-opensource/Kundera/issues/116
> 
> 
> To download, use or contribute to Kundera, visit:
> http://github.com/impetus-opensource/Kundera
> 
> Latest released tag version is 2.1. Kundera maven libraries are now available 
> at: https://oss.sonatype.org/content/repositories/releases/com/impetus and 
> http://kundera.googlecode.com/svn/maven2/maven-missing-resources.
> 
> Sample codes and examples for using Kundera can be found here:
> http://github.com/impetus-opensource/Kundera-Examples
> and 
> https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-tests
> 
> Thank you all for your contributions!
> 
> Regards,
> Kundera Team.

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Data Modeling - JSON vs Composite columns

2012-09-21 Thread Bill

> How does approach B work in CQL. Can we read/write a JSON
> easily in CQL?  Can we extract a field from a JSON in CQL
> or would that need to be done  via the client code?

Via client code. Support for this is much the same as support for JSON 
CLOBs in an RDBMS.


Approach A is better if you are going to update attributes as it avoids 
reads before writes, which can damage throughput.


A consideration for A are read times when the user has a high 
cardinality of  resulting in a wide row - in that case reads for 
an  should be tested. Writes tend to be cheap regardless of row size.


Bill

On 19/09/12 13:00, Roshni Rajagopal wrote:

Hi,

There was a conversation on this some time earlier, and to continue it

Suppose I want to associate a user to  an item, and I want to also store
3 commonly used attributes without needing to go to an entity item
column family , I have 2 options :-

A) use composite columns
UserId1 : {
  : = Betty Crocker,
  : = Cake
: = 5
  : = Nutella,
  : = Choc spread
: = 15
}

B) use a json with the data
UserId1 : {
   = {name: Betty Crocker,descr: Cake, Qty: 5},
   ={name: Nutella,descr: Choc spread, Qty: 15}
}

Essentially A is better if one wants to update individual fields , while
B is better if one wants easier paging, reading multiple items at once
in one read. etc. The details are in this discussion thread
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-another-question-td7581967.html

I had an additional question,
as its being said, that CQL is the direction in which cassandra is
moving, and there's a lot of effort in making CQL the standard,

How does approach B work in CQL. Can we read/write a JSON easily in CQL?
Can we extract a field from a JSON in CQL or would that need to be done
via the client code?

Regards,
Roshni




Re: Kundera 2.1 released

2012-09-21 Thread Vivek Mishra
Sorry for typo, this is 2.1 release.

-Vivek

On Sat, Sep 22, 2012 at 6:45 AM, Vivek Mishra  wrote:

> Hi All,
>
> We are happy to announce release of Kundera 2.0.7.
>
> Kundera is a JPA 2.0 based, object-datastore papping library for NoSQL
> datastores. The idea behind Kundera is to make working with NoSQL Databases
> drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB
> and relational databases.
>
> Major Changes in this release:
> ---
> * Allow user to set specific CQL versioning.
>
> * Batch insert/update for Cassandra/MongoDB/HBase.
>
> * Extended JPA Metamodel/TypedQuery/ProviderUtil implementation.
>
> * Another Thrift client implementation for Cassandra.
>
> * Deprecated support for properties with XML based Column
> family/Table/server specific property configuration for Cassandra, MongoDB
> and HBase.
>
> * Stronger query support:
>  a) JPQL support over all data types and associations.
>  b) JPQL support to query using primary key alongwith other columns.
>
>  * Fixed github issues:
>
>https://github.com/impetus-opensource/Kundera/issues/90
>https://github.com/impetus-opensource/Kundera/issues/91
>https://github.com/impetus-opensource/Kundera/issues/92
>https://github.com/impetus-opensource/Kundera/issues/93
>https://github.com/impetus-opensource/Kundera/issues/94
>https://github.com/impetus-opensource/Kundera/issues/96
>https://github.com/impetus-opensource/Kundera/issues/98
>https://github.com/impetus-opensource/Kundera/issues/99
>https://github.com/impetus-opensource/Kundera/issues/100
>https://github.com/impetus-opensource/Kundera/issues/101
>https://github.com/impetus-opensource/Kundera/issues/102
>https://github.com/impetus-opensource/Kundera/issues/104
>https://github.com/impetus-opensource/Kundera/issues/106
>https://github.com/impetus-opensource/Kundera/issues/107
>https://github.com/impetus-opensource/Kundera/issues/108
>https://github.com/impetus-opensource/Kundera/issues/109
>https://github.com/impetus-opensource/Kundera/issues/111
>https://github.com/impetus-opensource/Kundera/issues/112
>https://github.com/impetus-opensource/Kundera/issues/116
>
>
> To download, use or contribute to Kundera, visit:
> http://github.com/impetus-opensource/Kundera
>
> Latest released tag version is 2.1. Kundera maven libraries are now
> available at:
> https://oss.sonatype.org/content/repositories/releases/com/impetus and
> http://kundera.googlecode.com/svn/maven2/maven-missing-resources.
>
> Sample codes and examples for using Kundera can be found here:
> http://github.com/impetus-opensource/Kundera-Examples
> and
> https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-tests
>
> Thank you all for your contributions!
>
> Regards,
> Kundera Team.
>


Re: Using the commit log for external synchronization

2012-09-21 Thread Brian O'Neill
> IMHO it's a better design to multiplex the data stream at the application
> level.
+1, agreed.

That is where we ended up. (and Storm is proving to be a solid
framework for that)

-brian

On Fri, Sep 21, 2012 at 4:56 AM, aaron morton  wrote:
> The commit log is essentially internal implementation. The total size of the
> commit log is restricted, and the multiple files used to represent segments
> are recycled. So once all the memtables have been flushed for segment it may
> be overwritten.
>
> To archive the segments see the conf/commitlog_archiving.properties file.
>
> Large rows will bypass the commit log.
>
> A write commited to the commit log may still be considered a failure if CL
> nodes do not succeed.
>
> IMHO it's a better design to multiplex the data stream at the application
> level.
>
> Hope that helps.
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 21/09/2012, at 11:51 AM, Brian O'Neill  wrote:
>
>
> Along those lines...
>
> We sought to use triggers for external synchronization.   If you read
> through this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-1311
>
> You'll see the idea of leveraging a commit log for synchronization, via
> triggers.
>
> We went ahead and implemented this concept in:
> https://github.com/hmsonline/cassandra-triggers
>
> With that, via AOP, you get handed the mutation as things change.  We used
> it for synchronizing SOLR.
>
> fwiw,
> -brian
>
>
>
> On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:
>
> +1. Would be a pretty cool feature
>
> Right now I write once to cassandra and once to kafka.
>
> On 9/20/12 4:13 PM, "Data Craftsman 木匠" 
> wrote:
>
> This will be a good new feature. I guess the development team don't
>
> have time on this yet.  ;)
>
>
>
> On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x6e6...@gmail.com> wrote:
>
> Hi,
>
>
> I'd like to incrementally synchronize data written to Cassandra into
>
> an external store without having to maintain an index to do this, so I
>
> was wondering whether anybody is using the commit log to establish
>
> what updates have taken place since a given point in time?
>
>
> Cheers,
>
>
> Ben
>
>
>
>
> --
>
> Thanks,
>
>
> Charlie (@mujiang) 木匠
>
> ===
>
> Data Architect Developer 汉唐 田园牧歌DBA
>
> http://mujiang.blogspot.com
>
>
>
> 'Like' us on Facebook for exclusive content and other resources on all
> Barracuda Networks solutions.
> Visit http://barracudanetworks.com/facebook
>
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Using Cassandra BulkOuputFormat With newer versions of Hadoop (.23+)

2012-09-21 Thread Juan Valencia
I can't seem to get Bulk Loading to Work in newer versions of Hadoop.
since they switched JobContext from a class to an interface
You lose binary backward compatibility
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.cassandra.hadoop.BulkOutputFormat.checkOutputSpecs(BulkOutputFormat.java:42)

I tried recompiling against the newer Hadoop, but things got messy fast.
 Has anyone done this?


Re: Using the commit log for external synchronization

2012-09-21 Thread Rob Coli
On Fri, Sep 21, 2012 at 4:31 AM, Ben Hood <0x6e6...@gmail.com> wrote:
> So if I understand you correctly, one shouldn't code against what is
> essentially an internal artefact that could be subject to change as
> the Cassandra code base evolves and furthermore may not contain the
> information an application thinks it should contain.

Pretty much.

> So in summary, given that there is no out of the box way of saying to
> Cassandra "give me all mutations since timestamp X", I would either
> have to go for an event driven approach or reconsider the layout of
> the Cassandra store such that I could reconcile it in an efficient
> fashion.

With :

https://issues.apache.org/jira/browse/CASSANDRA-3690 - "Streaming
CommitLog backup"

You can stream your commitlog off-node as you write it. You can then
restore this commitlog and tell cassandra to replay the commit log
"until" a certain time by using "restore_point_in_time". But...
without :

https://issues.apache.org/jira/browse/CASSANDRA-4392 - "Create a tool
that will convert a commit log into a series of readable CQL
statements"

You are unable to skip bad transactions, so if you want to
roll-forward but skip a TRUNCATE, you are out of luck.

The above gets you most of the way there, but Aaron's point about the
commitlog not reflecting whether the app met its CL remains true. The
possibility that Cassandra might coalesce to a value that the
application does not know was successfully written is one of its known
edge cases...

=Rob

-- 
=Robert Coli
AIM>ALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Is Cassandra right for me?

2012-09-21 Thread Marcelo Elias Del Valle
Thanks a lot! Things are much more clear now.

2012/9/21 Michael Kjellman 

> Brisk is no longer actively developed by the original author or Datastax.
> It was left up for the community.
>
> https://github.com/steeve/brisk
>
> Has a fork that is supposedly compatible with 1.0 API
>
> Your more than welcome to fork that and make it work with 1.1 :)
>
> DSE != (Cassandra + Brisk)
>
> From: Marcelo Elias Del Valle  mvall...@gmail.com>>
> Reply-To: "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> Date: Friday, September 21, 2012 10:27 AM
> To: "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> Subject: Re: Is Cassandra right for me?
>
>
>
> 2012/9/20 aaron morton  aa...@thelastpickle.com>>
> Actually, if I use community edition for now, I wouldn't be able to use
> hadoop against data stored in CFS?
> AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale
> to use Hadoop against it, in the same way you can use hadoop against Apache
> Cassandra.
>
> You "can do" anything with computers if you have enough time and patience.
> DSE reduces the amount of time and patience needed to run Hadoop over
> Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store
> that run on Cassandra. This reduces the number of moving parts you need to
> provision.
>
> Can I use BRISK with Apache Cassandra, without changing Brisk or
> Cassandra's code? To the best of my knowledge, DSE uses Brisk, so I am
> afraid of writting hadoop process now and have to change them when I hire
> DSE support.
>
> I am not an expert in the Apache 2.0 license, but in my understanding Data
> Stax modified Apache Cassandra and included modifications to it in the
> version they sell. At the same time I am interested in hiring their
> support, I wanna keep compatibility with the open source version
> distributed in the mainstream, just in case I want to stop hiring their
> support at any time.
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
> 'Like' us on Facebook for exclusive content and other resources on all
> Barracuda Networks solutions.
>
> Visit http://barracudanetworks.com/facebook
>
>
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: Is Cassandra right for me?

2012-09-21 Thread Michael Kjellman
Brisk is no longer actively developed by the original author or Datastax. It 
was left up for the community.

https://github.com/steeve/brisk

Has a fork that is supposedly compatible with 1.0 API

Your more than welcome to fork that and make it work with 1.1 :)

DSE != (Cassandra + Brisk)

From: Marcelo Elias Del Valle mailto:mvall...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Friday, September 21, 2012 10:27 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?



2012/9/20 aaron morton mailto:aa...@thelastpickle.com>>
Actually, if I use community edition for now, I wouldn't be able to use hadoop 
against data stored in CFS?
AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale to 
use Hadoop against it, in the same way you can use hadoop against Apache 
Cassandra.

You "can do" anything with computers if you have enough time and patience. DSE 
reduces the amount of time and patience needed to run Hadoop over Cassandra. 
Specifically it helps by providing a HDFS and Hive Meta Store that run on 
Cassandra. This reduces the number of moving parts you need to provision.

Can I use BRISK with Apache Cassandra, without changing Brisk or Cassandra's 
code? To the best of my knowledge, DSE uses Brisk, so I am afraid of writting 
hadoop process now and have to change them when I hire DSE support.

I am not an expert in the Apache 2.0 license, but in my understanding Data Stax 
modified Apache Cassandra and included modifications to it in the version they 
sell. At the same time I am interested in hiring their support, I wanna keep 
compatibility with the open source version distributed in the mainstream, just 
in case I want to stop hiring their support at any time.


--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

'Like' us on Facebook for exclusive content and other resources on all 
Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook




Re: Is Cassandra right for me?

2012-09-21 Thread Marcelo Elias Del Valle
2012/9/20 aaron morton 

> Actually, if I use community edition for now, I wouldn't be able to use
> hadoop against data stored in CFS?
>
> AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale
> to use Hadoop against it, in the same way you can use hadoop against Apache
> Cassandra.
>
> You "can do" anything with computers if you have enough time and patience.
> DSE reduces the amount of time and patience needed to run Hadoop over
> Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store
> that run on Cassandra. This reduces the number of moving parts you need to
> provision.
>

Can I use BRISK with Apache Cassandra, without changing Brisk or
Cassandra's code? To the best of my knowledge, DSE uses Brisk, so I am
afraid of writting hadoop process now and have to change them when I hire
DSE support.

I am not an expert in the Apache 2.0 license, but in my understanding Data
Stax modified Apache Cassandra and included modifications to it in the
version they sell. At the same time I am interested in hiring their
support, I wanna keep compatibility with the open source version
distributed in the mainstream, just in case I want to stop hiring their
support at any time.


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: [problem with OOM in nodes]

2012-09-21 Thread Denis Gabaydulin
And some stuff from log:


/var/log/cassandra$ cat system.log | grep "Compacting large" | grep -E
"[0-9]+ bytes" -o | cut -d " " -f 1 |  awk '{ foo = $1 / 1024 / 1024 ;
print foo "MB" }'  | sort -nr | head -n 50
3821.55MB
3337.85MB
1221.64MB
1128.67MB
930.666MB
916.4MB
861.114MB
843.325MB
711.813MB
706.992MB
674.282MB
673.861MB
658.305MB
557.756MB
531.577MB
493.112MB
492.513MB
492.291MB
484.484MB
479.908MB
465.742MB
464.015MB
459.95MB
454.472MB
441.248MB
428.763MB
424.028MB
416.663MB
416.191MB
409.341MB
406.895MB
397.314MB
388.27MB
376.714MB
371.298MB
368.819MB
366.92MB
361.371MB
360.509MB
356.168MB
355.012MB
354.897MB
354.759MB
347.986MB
344.109MB
335.546MB
329.529MB
326.857MB
326.252MB
326.237MB

Is it bad signal?

On Fri, Sep 21, 2012 at 8:22 PM, Denis Gabaydulin  wrote:
> Found one more intersting fact.
> As I can see in cfstats, compacted row maximum size: 386857368 !
>
> On Fri, Sep 21, 2012 at 12:50 PM, Denis Gabaydulin  wrote:
>> Reports - is a SuperColumnFamily
>>
>> Each report has unique identifier (report_id). This is a key of
>> SuperColumnFamily.
>> And a report saved in separate row.
>>
>> A report is consisted of report rows (may vary between 1 and 50,
>> but most are small).
>>
>> Each report row is saved in separate super column. Hector based code:
>>
>> superCfMutator.addInsertion(
>>   report_id,
>>   "Reports",
>>   HFactory.createSuperColumn(
>> report_row_id,
>> mapper.convertObject(object),
>> columnDefinition.getTopSerializer(),
>> columnDefinition.getSubSerializer(),
>> inferringSerializer
>>   )
>> );
>>
>> We have two frequent operation:
>>
>> 1. count report rows by report_id (calculate number of super columns
>> in the row).
>> 2. get report rows by report_id and range predicate (get super columns
>> from the row with range predicate).
>>
>> I can't see here a big super columns :-(
>>
>> On Fri, Sep 21, 2012 at 3:10 AM, Tyler Hobbs  wrote:
>>> I'm not 100% that I understand your data model and read patterns correctly,
>>> but it sounds like you have large supercolumns and are requesting some of
>>> the subcolumns from individual super columns.  If that's the case, the issue
>>> is that Cassandra must deserialize the entire supercolumn in memory whenever
>>> you read *any* of the subcolumns.  This is one of the reasons why composite
>>> columns are recommended over supercolumns.
>>>
>>>
>>> On Thu, Sep 20, 2012 at 6:45 AM, Denis Gabaydulin  wrote:

 p.s. Cassandra 1.1.4

 On Thu, Sep 20, 2012 at 3:27 PM, Denis Gabaydulin 
 wrote:
 > Hi, all!
 >
 > We have a cluster with virtual 7 nodes (disk storage is connected to
 > nodes with iSCSI). The storage schema is:
 >
 > Reports:{
 > 1:{
 > 1:{"value1":"some val", "value2":"some val"},
 > 2:{"value1":"some val", "value2":"some val"}
 > ...
 > },
 > 2:{
 > 1:{"value1":"some val", "value2":"some val"},
 > 2:{"value1":"some val", "value2":"some val"}
 > ...
 > }
 > ...
 > }
 >
 > create keyspace osmp_reports
 >   with placement_strategy = 'SimpleStrategy'
 >   and strategy_options = {replication_factor : 4}
 >   and durable_writes = true;
 >
 > use osmp_reports;
 >
 > create column family QueryReportResult
 >   with column_type = 'Super'
 >   and comparator = 'BytesType'
 >   and subcomparator = 'BytesType'
 >   and default_validation_class = 'BytesType'
 >   and key_validation_class = 'BytesType'
 >   and read_repair_chance = 1.0
 >   and dclocal_read_repair_chance = 0.0
 >   and gc_grace = 432000
 >   and min_compaction_threshold = 4
 >   and max_compaction_threshold = 32
 >   and replicate_on_write = true
 >   and compaction_strategy =
 > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 >   and caching = 'KEYS_ONLY';
 >
 > =
 >
 > Read/Write CL: 2
 >
 > Most of the reports are small, but some of them could have a half
 > mullion of rows (xml). Typical operations on this dataset is:
 >
 > count report rows by report_id (top level id of super column);
 > get columns (report_rows) by range predicate and limit for given
 > report_id.
 >
 > A data is written once and hasn't never been updated.
 >
 > So, time to time a couple of nodes crashes with OOM exception. Heap
 > dump says, that we have a lot of super columns in memory.
 > For example, I see one of the reports is in memory entirely. How it
 > could be possible? If we don't load the whole report, cassandra could
 > whether do this for some internal reasons?
 >
 > What should we do to avoid OOMs?
>>>
>>>
>>>
>>>
>>> --
>>> Tyler Hobbs
>>> DataStax
>>>


Re: [problem with OOM in nodes]

2012-09-21 Thread Denis Gabaydulin
Found one more intersting fact.
As I can see in cfstats, compacted row maximum size: 386857368 !

On Fri, Sep 21, 2012 at 12:50 PM, Denis Gabaydulin  wrote:
> Reports - is a SuperColumnFamily
>
> Each report has unique identifier (report_id). This is a key of
> SuperColumnFamily.
> And a report saved in separate row.
>
> A report is consisted of report rows (may vary between 1 and 50,
> but most are small).
>
> Each report row is saved in separate super column. Hector based code:
>
> superCfMutator.addInsertion(
>   report_id,
>   "Reports",
>   HFactory.createSuperColumn(
> report_row_id,
> mapper.convertObject(object),
> columnDefinition.getTopSerializer(),
> columnDefinition.getSubSerializer(),
> inferringSerializer
>   )
> );
>
> We have two frequent operation:
>
> 1. count report rows by report_id (calculate number of super columns
> in the row).
> 2. get report rows by report_id and range predicate (get super columns
> from the row with range predicate).
>
> I can't see here a big super columns :-(
>
> On Fri, Sep 21, 2012 at 3:10 AM, Tyler Hobbs  wrote:
>> I'm not 100% that I understand your data model and read patterns correctly,
>> but it sounds like you have large supercolumns and are requesting some of
>> the subcolumns from individual super columns.  If that's the case, the issue
>> is that Cassandra must deserialize the entire supercolumn in memory whenever
>> you read *any* of the subcolumns.  This is one of the reasons why composite
>> columns are recommended over supercolumns.
>>
>>
>> On Thu, Sep 20, 2012 at 6:45 AM, Denis Gabaydulin  wrote:
>>>
>>> p.s. Cassandra 1.1.4
>>>
>>> On Thu, Sep 20, 2012 at 3:27 PM, Denis Gabaydulin 
>>> wrote:
>>> > Hi, all!
>>> >
>>> > We have a cluster with virtual 7 nodes (disk storage is connected to
>>> > nodes with iSCSI). The storage schema is:
>>> >
>>> > Reports:{
>>> > 1:{
>>> > 1:{"value1":"some val", "value2":"some val"},
>>> > 2:{"value1":"some val", "value2":"some val"}
>>> > ...
>>> > },
>>> > 2:{
>>> > 1:{"value1":"some val", "value2":"some val"},
>>> > 2:{"value1":"some val", "value2":"some val"}
>>> > ...
>>> > }
>>> > ...
>>> > }
>>> >
>>> > create keyspace osmp_reports
>>> >   with placement_strategy = 'SimpleStrategy'
>>> >   and strategy_options = {replication_factor : 4}
>>> >   and durable_writes = true;
>>> >
>>> > use osmp_reports;
>>> >
>>> > create column family QueryReportResult
>>> >   with column_type = 'Super'
>>> >   and comparator = 'BytesType'
>>> >   and subcomparator = 'BytesType'
>>> >   and default_validation_class = 'BytesType'
>>> >   and key_validation_class = 'BytesType'
>>> >   and read_repair_chance = 1.0
>>> >   and dclocal_read_repair_chance = 0.0
>>> >   and gc_grace = 432000
>>> >   and min_compaction_threshold = 4
>>> >   and max_compaction_threshold = 32
>>> >   and replicate_on_write = true
>>> >   and compaction_strategy =
>>> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
>>> >   and caching = 'KEYS_ONLY';
>>> >
>>> > =
>>> >
>>> > Read/Write CL: 2
>>> >
>>> > Most of the reports are small, but some of them could have a half
>>> > mullion of rows (xml). Typical operations on this dataset is:
>>> >
>>> > count report rows by report_id (top level id of super column);
>>> > get columns (report_rows) by range predicate and limit for given
>>> > report_id.
>>> >
>>> > A data is written once and hasn't never been updated.
>>> >
>>> > So, time to time a couple of nodes crashes with OOM exception. Heap
>>> > dump says, that we have a lot of super columns in memory.
>>> > For example, I see one of the reports is in memory entirely. How it
>>> > could be possible? If we don't load the whole report, cassandra could
>>> > whether do this for some internal reasons?
>>> >
>>> > What should we do to avoid OOMs?
>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax
>>


Re: Row caches

2012-09-21 Thread Tyler Hobbs
The cache metrics for Cassandra 1.1 are currently broken in OpsCenter, but
it's something we should be able to fix soon.  You can also use nodetool
cfstats to check the cache hit rates.

On Fri, Sep 21, 2012 at 5:34 AM, rohit reddy wrote:

> Hi,
>
> I have enabled the row caches by the using
> nodetool setcachecapacity   
> .
>
> But when i look at the cfstats.. i'm not getting any cache stats like
>
> Row cache capacity:
> Row cache size:
>
> These properties are not reflected nor i'm getting any cache hit rates in OPS 
> center.
>
> Do i need to restart the server or am i missing anything?
>
> Thanks
> Rohit
>
> On Fri, Sep 21, 2012 at 11:29 AM, rohit reddy 
> wrote:
>
>> Got it. Thanks for the replies
>>
>>
>> On Fri, Sep 21, 2012 at 6:30 AM, aaron morton wrote:
>>
>>> Set the caching attribute for the CF. It defaults to keys_only, other
>>> values are both or rows_only.
>>>
>>> See http://www.datastax.com/dev/blog/caching-in-cassandra-1-1
>>>
>>> Cheers
>>>
>>>   -
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 19/09/2012, at 1:34 PM, Jason Wee  wrote:
>>>
>>> which version is that? in version, 1.1.2 , nodetool does take the column
>>> family.
>>>
>>> setcachecapacity   
>>>  - Set the key and row cache capacities of a given column
>>> family
>>>
>>> On Wed, Sep 19, 2012 at 2:15 AM, rohit reddy >> > wrote:
>>>
 Hi,

 Is it possible to enable row cache per column family after the column
 family is created.

 *nodetool setcachecapacity* does not take the column family as input.

 Thanks
 Rohit

>>>
>>>
>>>
>>
>


-- 
Tyler Hobbs
DataStax 


Re: Using the commit log for external synchronization

2012-09-21 Thread Ben Hood
Hi Aaron,

Thanks for your input.

On Fri, Sep 21, 2012 at 9:56 AM, aaron morton  wrote:
> The commit log is essentially internal implementation. The total size of the
> commit log is restricted, and the multiple files used to represent segments
> are recycled. So once all the memtables have been flushed for segment it may
> be overwritten.
>
> To archive the segments see the conf/commitlog_archiving.properties file.
>
> Large rows will bypass the commit log.
>
> A write commited to the commit log may still be considered a failure if CL
> nodes do not succeed.

So if I understand you correctly, one shouldn't code against what is
essentially an internal artefact that could be subject to change as
the Cassandra code base evolves and furthermore may not contain the
information an application thinks it should contain.

> IMHO it's a better design to multiplex the data stream at the application
> level.

That's a fair point, and I could multicast the data at that level. The
reason why I was considering querying the commit log was because I
would prefer to implement a state based synchronization as opposed to
an event driven synchronization (which is what the app layer multicast
and the AOP solution Brian suggested would be). This is because I'd
rather know from Cassandra what Cassandra thinks it has got, rather
than trusting an event stream who can only infer what information
Cassandra should theoretically hold. The use case I am looking at
should be reconcilable and hence I'm trying to avoid placing trust in
the fact that all of the events were actually sent correctly, arrived
correctly and were written to the target storage without any bugs. I
also want to detect the scenario that portions of the data that was
written to the target system gets accidentally updated or nuked via a
back door.

So in summary, given that there is no out of the box way of saying to
Cassandra "give me all mutations since timestamp X", I would either
have to go for an event driven approach or reconsider the layout of
the Cassandra store such that I could reconcile it in an efficient
fashion.

Thanks for your help,

Cheers,

Ben


Re: Row caches

2012-09-21 Thread rohit reddy
Hi,

I have enabled the row caches by the using
nodetool setcachecapacity   
.

But when i look at the cfstats.. i'm not getting any cache stats like

Row cache capacity:
Row cache size:

These properties are not reflected nor i'm getting any cache hit rates
in OPS center.

Do i need to restart the server or am i missing anything?

Thanks
Rohit

On Fri, Sep 21, 2012 at 11:29 AM, rohit reddy wrote:

> Got it. Thanks for the replies
>
>
> On Fri, Sep 21, 2012 at 6:30 AM, aaron morton wrote:
>
>> Set the caching attribute for the CF. It defaults to keys_only, other
>> values are both or rows_only.
>>
>> See http://www.datastax.com/dev/blog/caching-in-cassandra-1-1
>>
>> Cheers
>>
>>   -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 19/09/2012, at 1:34 PM, Jason Wee  wrote:
>>
>> which version is that? in version, 1.1.2 , nodetool does take the column
>> family.
>>
>> setcachecapacity   
>>  - Set the key and row cache capacities of a given column
>> family
>>
>> On Wed, Sep 19, 2012 at 2:15 AM, rohit reddy 
>> wrote:
>>
>>> Hi,
>>>
>>> Is it possible to enable row cache per column family after the column
>>> family is created.
>>>
>>> *nodetool setcachecapacity* does not take the column family as input.
>>>
>>> Thanks
>>> Rohit
>>>
>>
>>
>>
>


Re: Losing keyspace on cassandra upgrade

2012-09-21 Thread Thomas Stets
On Fri, Sep 21, 2012 at 10:39 AM, aaron morton wrote:

> Have you tried nodetool resetlocalschema on the 1.1.5 ?
>

Yes, I tried a resetlocalschema, and a repair. This didn't change anything.

BTW I could find no documentation, what a resetlocalschema actually does...


  regards, Thomas


Re: Using the commit log for external synchronization

2012-09-21 Thread aaron morton
The commit log is essentially internal implementation. The total size of the 
commit log is restricted, and the multiple files used to represent segments are 
recycled. So once all the memtables have been flushed for segment it may be 
overwritten. 

To archive the segments see the conf/commitlog_archiving.properties file. 

Large rows will bypass the commit log. 

A write commited to the commit log may still be considered a failure if CL 
nodes do not succeed. 

IMHO it's a better design to multiplex the data stream at the application 
level.   
 
Hope that helps.

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 21/09/2012, at 11:51 AM, Brian O'Neill  wrote:

> 
> Along those lines...
> 
> We sought to use triggers for external synchronization.   If you read through 
> this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-1311
> 
> You'll see the idea of leveraging a commit log for synchronization, via 
> triggers.
> 
> We went ahead and implemented this concept in:
> https://github.com/hmsonline/cassandra-triggers
> 
> With that, via AOP, you get handed the mutation as things change.  We used it 
> for synchronizing SOLR.  
> 
> fwiw,
> -brian
> 
> 
> 
> On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:
> 
>> +1. Would be a pretty cool feature
>> 
>> Right now I write once to cassandra and once to kafka.
>> 
>> On 9/20/12 4:13 PM, "Data Craftsman 木匠" 
>> wrote:
>> 
>>> This will be a good new feature. I guess the development team don't
>>> have time on this yet.  ;)
>>> 
>>> 
>>> On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x6e6...@gmail.com> wrote:
 Hi,
 
 I'd like to incrementally synchronize data written to Cassandra into
 an external store without having to maintain an index to do this, so I
 was wondering whether anybody is using the commit log to establish
 what updates have taken place since a given point in time?
 
 Cheers,
 
 Ben
>>> 
>>> 
>>> 
>>> -- 
>>> Thanks,
>>> 
>>> Charlie (@mujiang) 木匠
>>> ===
>>> Data Architect Developer 汉唐 田园牧歌DBA
>>> http://mujiang.blogspot.com
>> 
>> 
>> 'Like' us on Facebook for exclusive content and other resources on all 
>> Barracuda Networks solutions.
>> Visit http://barracudanetworks.com/facebook
>> 
>> 
> 
> -- 
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
> 



Re: [problem with OOM in nodes]

2012-09-21 Thread Denis Gabaydulin
Reports - is a SuperColumnFamily

Each report has unique identifier (report_id). This is a key of
SuperColumnFamily.
And a report saved in separate row.

A report is consisted of report rows (may vary between 1 and 50,
but most are small).

Each report row is saved in separate super column. Hector based code:

superCfMutator.addInsertion(
  report_id,
  "Reports",
  HFactory.createSuperColumn(
report_row_id,
mapper.convertObject(object),
columnDefinition.getTopSerializer(),
columnDefinition.getSubSerializer(),
inferringSerializer
  )
);

We have two frequent operation:

1. count report rows by report_id (calculate number of super columns
in the row).
2. get report rows by report_id and range predicate (get super columns
from the row with range predicate).

I can't see here a big super columns :-(

On Fri, Sep 21, 2012 at 3:10 AM, Tyler Hobbs  wrote:
> I'm not 100% that I understand your data model and read patterns correctly,
> but it sounds like you have large supercolumns and are requesting some of
> the subcolumns from individual super columns.  If that's the case, the issue
> is that Cassandra must deserialize the entire supercolumn in memory whenever
> you read *any* of the subcolumns.  This is one of the reasons why composite
> columns are recommended over supercolumns.
>
>
> On Thu, Sep 20, 2012 at 6:45 AM, Denis Gabaydulin  wrote:
>>
>> p.s. Cassandra 1.1.4
>>
>> On Thu, Sep 20, 2012 at 3:27 PM, Denis Gabaydulin 
>> wrote:
>> > Hi, all!
>> >
>> > We have a cluster with virtual 7 nodes (disk storage is connected to
>> > nodes with iSCSI). The storage schema is:
>> >
>> > Reports:{
>> > 1:{
>> > 1:{"value1":"some val", "value2":"some val"},
>> > 2:{"value1":"some val", "value2":"some val"}
>> > ...
>> > },
>> > 2:{
>> > 1:{"value1":"some val", "value2":"some val"},
>> > 2:{"value1":"some val", "value2":"some val"}
>> > ...
>> > }
>> > ...
>> > }
>> >
>> > create keyspace osmp_reports
>> >   with placement_strategy = 'SimpleStrategy'
>> >   and strategy_options = {replication_factor : 4}
>> >   and durable_writes = true;
>> >
>> > use osmp_reports;
>> >
>> > create column family QueryReportResult
>> >   with column_type = 'Super'
>> >   and comparator = 'BytesType'
>> >   and subcomparator = 'BytesType'
>> >   and default_validation_class = 'BytesType'
>> >   and key_validation_class = 'BytesType'
>> >   and read_repair_chance = 1.0
>> >   and dclocal_read_repair_chance = 0.0
>> >   and gc_grace = 432000
>> >   and min_compaction_threshold = 4
>> >   and max_compaction_threshold = 32
>> >   and replicate_on_write = true
>> >   and compaction_strategy =
>> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
>> >   and caching = 'KEYS_ONLY';
>> >
>> > =
>> >
>> > Read/Write CL: 2
>> >
>> > Most of the reports are small, but some of them could have a half
>> > mullion of rows (xml). Typical operations on this dataset is:
>> >
>> > count report rows by report_id (top level id of super column);
>> > get columns (report_rows) by range predicate and limit for given
>> > report_id.
>> >
>> > A data is written once and hasn't never been updated.
>> >
>> > So, time to time a couple of nodes crashes with OOM exception. Heap
>> > dump says, that we have a lot of super columns in memory.
>> > For example, I see one of the reports is in memory entirely. How it
>> > could be possible? If we don't load the whole report, cassandra could
>> > whether do this for some internal reasons?
>> >
>> > What should we do to avoid OOMs?
>
>
>
>
> --
> Tyler Hobbs
> DataStax
>


Re: Losing keyspace on cassandra upgrade

2012-09-21 Thread aaron morton
Have you tried nodetool resetlocalschema on the 1.1.5 ?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/09/2012, at 11:41 PM, Thomas Stets  wrote:

> A follow-up:
> 
> Currently I'm back on version 1.1.1.
> 
> I tried - unsuccessfully - the following things:
> 
> 1. Create the missing keyspace on the 1.1.5 node, then copy the files back 
> into the data directory.
> This failed, since the keyspace was already known on the other node in the 
> cluster.
> 
> 2. shut down the 1.1.1 node, that still has the keyspace. Then create the 
> keyspace on the 1.1.5 node.
> This failes since the node could not distribute the information through the 
> cluster.
> 
> 3. Restore the system keyspace from the snapshot I made before the upgrade.
> The restore seemed to work, but the node behaved just like after the update: 
> it just forgot my keyspace.
> 
> Right now I'm at a loss on how to proceed. Any ideas? I'm pretty sure I can 
> reproduce the problem,
> so if anyone has an idea on what to try, or where to look, I can do some 
> tests (within limits)
> 
> 
> On Wed, Sep 19, 2012 at 4:43 PM, Thomas Stets  wrote:
> I consistently keep losing my keyspace on upgrading from cassandra 1.1.1 to 
> 1.1.5
> 
> I have the same cassandra keyspace on all our staging systems:
> 
> development:  a 3-node cluster
> integration: a 3-node cluster
> QS: a 2-node cluster
> (productive will be a 4-node cluster, which is as yet not active)
> 
> All clusters were running cassandra 1.1.1. Before going productive I wanted 
> to upgrade to the
> latest productive version of cassandra.
> 
> In all cases my keyspace disappeared when I started the cluster with 
> cassandra 1.1.5.
> On the development system I didn't realize at first what was happening. I 
> just wondered that nodetool
> showed a very low amount of data. On integration I saw the problem quickly, 
> but could not recover the
> data. I re-installed the cassandra cluster from scratch, and populated it 
> with our test data, so our
> developers could work.
>  ...  
> 
> 
>   TIA, Thomas
>