Re: Sharing Cassandra with Solandra

2011-06-29 Thread AJ

On 6/27/2011 3:39 PM, David Strauss wrote:

On Mon, 2011-06-27 at 15:06 -0600, AJ wrote:

Would anyone care to talk about their experiences with using Solandra
along side another application that uses Cassandra (also on the same
node)?  I'm curious about any resource contention issues or
compatibility between C* versions and Sol.  Also, I read the developer
somewhere say that you have to run Solandra on every C* node in the
ring.  I'm not sure if I interpreted that correctly.  Also, what's the
index size to data size ratio to expect (ballpark)?  How does it
perform?  Any caveats?

We're currently keeping the clusters separate at Pantheon Systems
because our core API (which runs on standard Cassandra) is often ready
for the next Cassandra version at a different time than Solandra.
Solandra recently gained dual 0.7/0.8 support, but we're still opting to
use the version on Cassandra that Solandra is primarily being built and
tested on (which is currently 0.8).


Thanks.  But, I'm finally cluing in that Solandra is also developed by 
DataStax, so I feel safer about future compatibility.


Ec2 snitch with network topology strategy

2011-06-29 Thread pankajsoni0126
I was thinking of leveraging ec2 snitch. But my question is then how do I
give replica placement options? 

Or can I give snitch as ec2snitch and write the nodes
cassandra-topology.prop and in give locator strategy at time of creating
keyspace as network topology strategy. But will it work?

And those who are struggling to deploy cassandra with across ec2 regions.

1. approach is to use milind's patch, it works but has some limitation.
https://issues.apache.org/jira/browse/CASSANDRA-2362
2. openvpn is a good option but neverthless is futile with encryption
available in 0.8.0 cassandra
3. Vijay has come up with a patch and so far tested I have not seen any
jerks.
https://issues.apache.org/jira/browse/CASSANDRA-2452 - its marked to be
there in 0.8.2 release.


-pankaj




--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Ec2-snitch-with-network-topology-strategy-tp6528188p6528188.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


question on capacity planning

2011-06-29 Thread Jacob, Arun
if I'm planning to store 20TB of new data per week, and expire all data every 2 
weeks, with a replication factor of 3, do I only need approximately 120 TB of 
disk? I'm going to use ttl in my column values to automatically expire data. Or 
would I need more capacity to handle sstable merges? Given this amount of data, 
would you recommend node storage at 2TB per node or more? This application will 
have a heavy write /moderate read use profile.

-- Arun


Re: Ec2 snitch with network topology strategy

2011-06-29 Thread pankaj soni
Hmm... Just tested the config. It works, got confused with the options, my
bad.

On Wed, Jun 29, 2011 at 2:26 PM, pankajsoni0126 pankajsoni0...@gmail.comwrote:

 I was thinking of leveraging ec2 snitch. But my question is then how do I
 give replica placement options?

 Or can I give snitch as ec2snitch and write the nodes
 cassandra-topology.prop and in give locator strategy at time of creating
 keyspace as network topology strategy. But will it work?

 And those who are struggling to deploy cassandra with across ec2 regions.

 1. approach is to use milind's patch, it works but has some limitation.
 https://issues.apache.org/jira/browse/CASSANDRA-2362
 2. openvpn is a good option but neverthless is futile with encryption
 available in 0.8.0 cassandra
 3. Vijay has come up with a patch and so far tested I have not seen any
 jerks.
 https://issues.apache.org/jira/browse/CASSANDRA-2452 - its marked to be
 there in 0.8.2 release.


 -pankaj




 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Ec2-snitch-with-network-topology-strategy-tp6528188p6528188.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Cannot set column value to zero

2011-06-29 Thread dnallsopp
I had a strange problem recently where I was unable to set the value of a column
to '0' (it always returned '1') but setting it to other values worked fine:

[default@Test] set Urls['rowkey']['status']='1';
Value inserted.
[default@Test] get Urls['rowkey'];
= (column=status, value=1, timestamp=1309189541891000)
Returned 1 results.

[default@Test] set Urls['rowkey']['status']='0';
Value inserted.
[default@Test] get Urls['rowkey'];
= (column=status, value=1, timestamp=1309189551407616)
Returned 1 results.

This was on a one-node test cluster (v0.7.6) with no other clients; setting
other values (e.g. '9') worked fine. However, attempting to set the value back
to '0' always resulted in a value of '1'.

I noticed this shortly after truncating the CF.

The column family was shown as follows below. One thing that looks odd is that
on other test clusters the Column Name is followed by a reference to
the index, e.g. Column Name: status (737461747573) - but here it isn't.

I was wondering if there was some interaction between truncating the CF and the
use of a KEYS index? (Presumably it would be safer to delete all data
directories in order to wipe the cluster during experimentation, rather than
truncating?)

Unfortunately I'm not sure how to recreate the situation as this was a test
machine on which I played around with various configurations - but maybe
someone has seen a similar problem elsewhere? In the end I had to wipe the data
and start again, and all seemed fine, although the index reference is still
absent as mentioned above.

[default@Test] describe keyspace;
Keyspace: Test:
...
ColumnFamily: Foo
  default_validation_class: org.apache.cassandra.db.marshal.BytesType
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds: 0.0/0
  Key cache size / save period in seconds: 0.0/14400
  Memtable thresholds: 0.5/128/60 (millions of ops/minutes/MB)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [Foo.737461747573]
  Column Metadata:
Column Name: status
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Index Type: KEYS
...


This message was sent using IMP, the Internet Messaging Program.

This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is addressed.
If you are not the intended recipient of this email, you must neither
take any action based upon its contents, nor copy or show it to anyone.
Please contact the sender if you believe you have received this email in
error. QinetiQ may monitor email traffic data and also the content of
email for the purposes of security. QinetiQ Limited (Registered in
England  Wales: Company Number: 3796233) Registered office: Cody Technology 
Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.


Re: question on capacity planning

2011-06-29 Thread Ryan King
On Wed, Jun 29, 2011 at 5:36 AM, Jacob, Arun arun.ja...@disney.com wrote:
 if I'm planning to store 20TB of new data per week, and expire all data
 every 2 weeks, with a replication factor of 3, do I only need approximately
 120 TB of disk? I'm going to use ttl in my column values to automatically
 expire data. Or would I need more capacity to handle sstable merges? Given
 this amount of data, would you recommend node storage at 2TB per node or
 more? This application will have a heavy write /moderate read use profile.

You'll need extra space for both compaction and the overhead in the
storage format.

As to the amount of storage per node, that depends on your latency and
throughput requirements.

-ryan


Data storage security

2011-06-29 Thread A J
Are there any options to encrypt the column families when they are
stored in the database. Say in a given keyspace some CF has sensitive
info and I don't want a 'select *' of that CF to layout the data in
plain text.

Thanks.


Re: Data storage security

2011-06-29 Thread Eric tamme
On Wed, Jun 29, 2011 at 12:37 PM, A J s5a...@gmail.com wrote:
 Are there any options to encrypt the column families when they are
 stored in the database. Say in a given keyspace some CF has sensitive
 info and I don't want a 'select *' of that CF to layout the data in
 plain text.

 Thanks.


I think this is an application layer issue - just encrypt/decrypt
there.  The data stored within the column value can be any arbitrary
bytes, and since column data is not indexed it wont affect how you can
access the data with Cassandra in any way.

-Eric


Re: custom reconciling columns? (improve performance of long rows )

2011-06-29 Thread Yang
I hacked around the code, and first I thought that the cost on map put and
get was due to the synchronization cost , so I tried
replacing concurrentSkipListMap with TreeMap. I created a subclass of
ColumnFamily and use the subclass only in pure read path : interestingly
on the read path, no more than one thread accesses the return CF at any
time, so we can remove the concurrency control.
but it did not offer any significant change in speed.

then I tried changing TreeMap to HashMap, this time, it uses only half the
time. but the problem is how to keep the sorted output. doing a sort on
every return is going to be even slower...




On Tue, Jun 28, 2011 at 10:07 PM, Yang tedd...@gmail.com wrote:

 btw I use only one box now just because I'm running it on dev junit test,
 not that it's going to be that way in production


 On Tue, Jun 28, 2011 at 10:06 PM, Yang tedd...@gmail.com wrote:

 ok, here is the profiling result. I think this is consistent (having been
 trying to recover how to effectively use yourkit ...)  see attached picture

 since I actually do not use the thrift interface, but just directly use
 the thrift.CassandraServer and run my code in the same JVM as cassandra,
 and was running the whole thing on a single box, there is no message
 serialization/deserialization cost. but more columns did add on to more
 time.

 the time was spent in the ConcurrentSkipListMap operations that implement
 the memtable.


 regarding breaking up the row, I'm not sure it would reduce my run time,
 since our requirement is to read the entire rolling window history (we
 already have
 the TTL enabled , so the history is limited to a certain length, but it is
 quite long: over 1000 , in some  cases, can be 5000 or more ) .  I think
 accessing roughly 1000 items is not an uncommon requirement for many
 applications. in our case, each column has about 30 bytes of data, besides
 the meta data such as ttl, timestamp.
 at history length of 3000, the read takes about 12ms (remember this is
 completely in-memory, no disk access)

 I just took a look at the expiring column logic, it looks that the
 expiration does not come into play until when the
 CassandraServer.internal_get()===thriftifyColumns() gets called. so the
 above memtable access time is still spent. yes, then breaking up the row is
 going to be helpful, but only to the degree of preventing accessing
 expired columns (btw  if this is actually built into cassandra code it
 would be nicer, so instead of spending multiple key lookups, I locate to the
 row once, and then within the row, there are different generation buckets,
 so those old generation buckets that are beyond expiration are not read );
 currently just accessing the 3000 live columns is already quite slow.

 I'm trying to see whether there are some easy magic bullets for a drop-in
 replacement for concurrentSkipListMap...

 Yang




 On Tue, Jun 28, 2011 at 4:18 PM, Nate McCall n...@datastax.com wrote:

 I agree with Aaron's suggestion on data model and query here. Since
 there is a time component, you can split the row on a fixed duration
 for a given user, so the row key would become userId_[timestamp
 rounded to day].

 This provides you an easy way to roll up the information for the date
 ranges you need since the key suffix can be created without a read.
 This also benefits from spreading the read load over the cluster
 instead of just the replicas since you have 30 rows in this case
 instead of one.

 On Tue, Jun 28, 2011 at 5:55 PM, aaron morton aa...@thelastpickle.com
 wrote:
  Can you provide some more info:
  - how big are the rows, e.g. number of columns and column size  ?
  - how much data are you asking for ?
  - what sort of read query are you using ?
  - what sort of numbers are you seeing ?
  - are you deleting columns or using TTL ?
  I would consider issues with the data churn, data model and query
 before
  looking at serialisation.
  Cheers
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 29 Jun 2011, at 10:37, Yang wrote:
 
  I can see that as my user history grows, the reads time proportionally
 ( or
  faster than linear) grows.
  if my business requirements ask me to keep a month's history for each
 user,
  it could become too slow.- I was suspecting that it's actually the
  serializing and deserializing that's taking time (I can definitely it's
 cpu
  bound)
 
 
  On Tue, Jun 28, 2011 at 3:04 PM, aaron morton aa...@thelastpickle.com
 
  wrote:
 
  There is no facility to do custom reconciliation for a column. An
 append
  style operation would run into many of the same problems as the
 Counter
  type, e.g. not every node may get an append and there is a chance for
 lost
  appends unless you go to all the trouble Counter's do.
 
  I would go with using a row for the user and columns for each item.
 Then
  you can have fast no look writes.
 
  What problems are you seeing with the reads ?
 
  

hadoop results

2011-06-29 Thread William Oberman
I'll start with my question: given a CF with comparator TimeUUIDType, what
is the most efficient way to get the greatest column's value?

Context: I've been running cassandra for a couple of months now, so
obviously it's time to start layering more on top :-)  In my test
environment, I managed to get pig/hadoop running, and developed a few
scripts to collect metrics I've been missing since I switched from MySQL to
cassandra (including the ever useful select count(*) from table
equivalent).

I was hoping to dump the results of this processing back into cassandra for
use in other tools/processes.  My initial thought was: new CF called stats
with comparator TimeUUIDType.  The basic idea being I'd store:
stat_name - time stat was computed (as UUID) - value
That way I can also see a historical perspective of any given stat for
auditing (and for cumulative stats to see trends).  The stat_name itself is
a URI that is composed of what and any constraints on the what
(including an optional time range, if the stat supports it).  E.g.
ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still
deciding on the format of the URI).  But, right now, the only way I know to
get the current stat value would be to iterate over all columns (the
TimeUUIDs) and then return the last one.

Thanks for any tips,

will


CQL injection attacks?

2011-06-29 Thread dnallsopp

Someone asked a while ago whether Cassandra was vulnerable to injection attacks:

http://stackoverflow.com/questions/5998838/nosql-injection-php-phpcassa-cassandra

With Thrift, the answer was 'no'.

With CQL, presumably the situation is different, at least until prepared
statements are possible (CASSANDRA-2475) ?

Has there been any discussion on this already that someone could point me to,
please? I couldn't see anything on JIRA (searching for CQL AND injection, CQL
AND security, etc).

Thanks.


This message was sent using IMP, the Internet Messaging Program.

This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is addressed.
If you are not the intended recipient of this email, you must neither
take any action based upon its contents, nor copy or show it to anyone.
Please contact the sender if you believe you have received this email in
error. QinetiQ may monitor email traffic data and also the content of
email for the purposes of security. QinetiQ Limited (Registered in
England  Wales: Company Number: 3796233) Registered office: Cody Technology 
Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.


RE: RAID or no RAID

2011-06-29 Thread Jeremiah Jordan
With multiple data dirs you are still limited by the space free on any
one drive.  So if you have two data dirs with 40GB free on each, and you
have 50GB to be compacted, it won't work, but if you had a raid, you
would have 80GB free and could compact... 

-Original Message-
From: mcasandra [mailto:mohitanch...@gmail.com] 
Sent: Tuesday, June 28, 2011 7:55 PM
To: cassandra-u...@incubator.apache.org
Subject: Re: RAID or no RAID


aaron morton wrote:
 
 Not sure what the intended purpose is, but we've mostly used it as an

 emergency disk-capacity-increase option
 
 Thats what I've used it for.  
 
 Cheers
 

How does compaction work in terms of utilizing multiple data dirs? Also,
is there a reference on wiki somewhere that says not to use multiple
data dirs?


--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RAID-or
-no-RAID-tp6522904p6527219.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive
at Nabble.com.


Chunking if size 64MB

2011-06-29 Thread A J
From what I read, Cassandra allows a single column value to be up-to
2GB but would chunk the data if greater than 64MB.
Is the chunking transparent to the application or does the app need to
know if/how/when the chunking happened for a specific column value
that happened to be  64MB.

Thank you.


api to extract gossiper results

2011-06-29 Thread A J
Cassandra uses accrual failure detector to interpret the gossips.
Is it somehow possible to extract these (gossip values and results of
the failure detector) in an external system ?

Thanks


Cassandra client loses connectivity to cluster

2011-06-29 Thread Jim Ancona
In reviewing client logs as part of our Cassandra testing, I noticed
several Hector All host pools marked down exceptions in the logs.
Further investigation showed a consistent pattern of
java.net.SocketException: Broken pipe and java.net.SocketException:
Connection reset messages. These errors occur for all 36 hosts in the
cluster over a period of seconds, as Hector tries to find a working
host to connect to. Failing to find a host results in the All host
pools marked down messages. These messages recur for a period ranging
from several seconds up to almost 15 minutes, clustering around two to
three minutes. Then connectivity returns and when Hector tries to
reconnect it succeeds.

The clients are instances of a JBoss 5 web application. We use Hector
0.7.0-29 (plus a patch that was pulled in advance of -30) The
Cassandra cluster has 72 nodes split between two datacenters. It's
running 0.7.5 plus a couple of bug fixes pulled in advance of 0.7.6.
The keyspace uses NetworkTopologyStrategy and RF=6 (3 in each
datacenter). The clients are reading and writing at LOCAL_QUORUM to
the 36 nodes in their own data center. Right now the second datacenter
is for failover only, so there are no clients actually writing there.

There's nothing else obvious in the JBoss logs at around the same
time, e.g. other application errors, GC events. The Cassandra
system.log files at INFO level shows nothing out of the ordinary. I
have a capture of one of the incidents at DEBUG level where again I
see nothing abnormal looking, but there's so much data that it would
be easy to miss something.

Other observations:
* It only happens on weekdays (Our weekends are much lower load)
* It has occurred every weekday for the last month except for Monday
May 30, the Memorial Day holiday in the US.
* Most days it occurs only once, but six times it has occurred twice,
never more often than that.
* It generally happens in the late afternoon, but there have been
occurrences earlier in the afternoon and twice in the late morning.
Earliest occurrence is 11:19 am, latest is 18:11 pm. Our peak loads
are between 10:00 and 14:00, so most occurrences do *not* correspond
with peak load times.
* It only happens on a single client JBoss instance at a time.
* Generally, it affects a different host each day, but the same host
was affected on consecutive days once.
* Out of 40 clients, one has been affected three times, seven have
been affected twice, 11 have been affected once and 21 have not been
affected.
* The cluster is lightly loaded.

Given that the problem affects a single client machine at a time and
that machine loses the ability to connect to the entire cluster, It
seems unlikely that the problem is on the C* server side. Even a
network problem seems hard to explain, given that the clients are on
the same subnet, I would expect all of them to fail if it were a
network issue.

I'm hoping that perhaps someone has seen a similar issue or can
suggest things to try.

Thanks in advance for any help!

Jim


Re: api to extract gossiper results

2011-06-29 Thread Edward Capriolo
A simple solution is to setup log4j to a DEBUG level on Gossip events.

You can also use the StorageProxy/Fat client and then participate in gossip.
Each system has its own converging view of the ring, thus what your local
gossip things is the topology may not be the same across the cluster.

Edward

On Wed, Jun 29, 2011 at 5:20 PM, A J s5a...@gmail.com wrote:

 Cassandra uses accrual failure detector to interpret the gossips.
 Is it somehow possible to extract these (gossip values and results of
 the failure detector) in an external system ?

 Thanks



Re: No Transactions: An Example

2011-06-29 Thread AJ


On 6/22/2011 9:18 AM, Trevor Smith wrote:
Right -- that's the part that I am more interested in fleshing out in 
this post.




Here is one way.  Use MVCC 
http://en.wikipedia.org/wiki/Multiversion_concurrency_control.  A 
single global clean-up process would be acceptable since it's not a 
single point of failure, only a single point of accumulating back-logged 
work and will not affect availability as long as you are notified if 
that process terminates and restart it in a reasonable amount of time 
but this will not affect the validity of subsequent reads.


So, you would have a balance column.  And each update will create a 
balance_timestamp with a positive or negative value indicating a 
credit or debit.  Subsequent clients will read the latest value by doing 
a slice from balance to balance_~ (i.e. all balance* columns).  
(You would have to work-out your column naming conventions so that your 
slices return only the pertinent columns.)  Then, the clients would have 
to apply all the credits and debits to the balance to get the current 
balance.


This handles the lost update problem.

For the dirty read and incorrect summary problems by others reading data 
that is in the middle of a transaction that hasn't committed yet, I 
would add a final transaction column to a Transactions CF.  The key 
would be cf.key.column, e.g., Accounts.1234.balance, 1234 being 
the account # and Accounts being the CF owning the balance column.  
Then, a new column would be added for each successful transaction (e.g., 
after debiting and crediting the two accounts) using the same timestamp 
used in balance_timestamp.  So, now, a client wanting the current 
balance would have to do a slice for all of the transactions for that 
column and only apply the balance updates up to the latest transaction.  
Note, you might have to do something else with the transaction naming 
schemes to make sure they are guaranteed to be unique, but you get the 
idea.  If the transaction fails, the client simply does not add a 
transaction column to Transactions and deletes any balance_timestamp 
columns it added to in the Accounts CF (or let's the clean-up process do 
it... carefully).


This should avoid the need for locks and as long as each account doesn't 
have a crazy amount of updates, the slices shouldn't be so large as to 
be a significant perf hit.


A note about the updates.  You have to make sure the clean-up process 
processes the updates in order and only 1 time.  If you can't guarantee 
these, then you'll have to make sure your updates are idempotent and 
commutative.


Oh yeah, and you must use QUORUM read/writes, of course.

Any critiques?

aj


Re: Cannot set column value to zero

2011-06-29 Thread aaron morton
The extra () in the describe keyspace output is only there if the column 
comparator is the BytesType, the client tries to format the data as UTF8. 

Dont forget truncate is doing snapshots, so check the snapshots dir and delete 
things if you are using it a lot for testing. 

The 0 == 1 thing does not ring any bells. Let us know if it happens again. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 30 Jun 2011, at 02:13, dnalls...@taz.qinetiq.com wrote:

 I had a strange problem recently where I was unable to set the value of a 
 column
 to '0' (it always returned '1') but setting it to other values worked fine:
 
 [default@Test] set Urls['rowkey']['status']='1';
 Value inserted.
 [default@Test] get Urls['rowkey'];
 = (column=status, value=1, timestamp=1309189541891000)
 Returned 1 results.
 
 [default@Test] set Urls['rowkey']['status']='0';
 Value inserted.
 [default@Test] get Urls['rowkey'];
 = (column=status, value=1, timestamp=1309189551407616)
 Returned 1 results.
 
 This was on a one-node test cluster (v0.7.6) with no other clients; setting
 other values (e.g. '9') worked fine. However, attempting to set the value back
 to '0' always resulted in a value of '1'.
 
 I noticed this shortly after truncating the CF.
 
 The column family was shown as follows below. One thing that looks odd is that
 on other test clusters the Column Name is followed by a reference to
 the index, e.g. Column Name: status (737461747573) - but here it isn't.
 
 I was wondering if there was some interaction between truncating the CF and 
 the
 use of a KEYS index? (Presumably it would be safer to delete all data
 directories in order to wipe the cluster during experimentation, rather than
 truncating?)
 
 Unfortunately I'm not sure how to recreate the situation as this was a test
 machine on which I played around with various configurations - but maybe
 someone has seen a similar problem elsewhere? In the end I had to wipe the 
 data
 and start again, and all seemed fine, although the index reference is still
 absent as mentioned above.
 
 [default@Test] describe keyspace;
 Keyspace: Test:
 ...
ColumnFamily: Foo
  default_validation_class: org.apache.cassandra.db.marshal.BytesType
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds: 0.0/0
  Key cache size / save period in seconds: 0.0/14400
  Memtable thresholds: 0.5/128/60 (millions of ops/minutes/MB)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [Foo.737461747573]
  Column Metadata:
Column Name: status
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Index Type: KEYS
 ...
 
 
 This message was sent using IMP, the Internet Messaging Program.
 
 This email and any attachments to it may be confidential and are
 intended solely for the use of the individual to whom it is addressed.
 If you are not the intended recipient of this email, you must neither
 take any action based upon its contents, nor copy or show it to anyone.
 Please contact the sender if you believe you have received this email in
 error. QinetiQ may monitor email traffic data and also the content of
 email for the purposes of security. QinetiQ Limited (Registered in
 England  Wales: Company Number: 3796233) Registered office: Cody Technology 
 Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.



Re: hadoop results

2011-06-29 Thread aaron morton
How about  get_slice() with reversed == true and count = 1 to get the highest 
time UUID ? 

Or you can also store a column with a magic name that have the value of the 
timeuuid that is the current metric to use. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 30 Jun 2011, at 06:35, William Oberman wrote:

 I'll start with my question: given a CF with comparator TimeUUIDType, what is 
 the most efficient way to get the greatest column's value?
 
 Context: I've been running cassandra for a couple of months now, so obviously 
 it's time to start layering more on top :-)  In my test environment, I 
 managed to get pig/hadoop running, and developed a few scripts to collect 
 metrics I've been missing since I switched from MySQL to cassandra (including 
 the ever useful select count(*) from table equivalent).  
 
 I was hoping to dump the results of this processing back into cassandra for 
 use in other tools/processes.  My initial thought was: new CF called stats 
 with comparator TimeUUIDType.  The basic idea being I'd store:
 stat_name - time stat was computed (as UUID) - value
 That way I can also see a historical perspective of any given stat for 
 auditing (and for cumulative stats to see trends).  The stat_name itself is a 
 URI that is composed of what and any constraints on the what (including 
 an optional time range, if the stat supports it).  E.g. 
 ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still 
 deciding on the format of the URI).  But, right now, the only way I know to 
 get the current stat value would be to iterate over all columns (the 
 TimeUUIDs) and then return the last one.
 
 Thanks for any tips,
 
 will



Re: Chunking if size 64MB

2011-06-29 Thread aaron morton
AFAIK there is no server side chunking of column values.

This link http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage is 
just suggesting in the app you do not store more than 64MB per column. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 30 Jun 2011, at 07:25, A J wrote:

 From what I read, Cassandra allows a single column value to be up-to
 2GB but would chunk the data if greater than 64MB.
 Is the chunking transparent to the application or does the app need to
 know if/how/when the chunking happened for a specific column value
 that happened to be  64MB.
 
 Thank you.